0% found this document useful (0 votes)

371 views1,777 pages

Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA

Uploaded by

Harshal Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

371 views1,777 pages

Sr. No. Questions A B C D Ans: Unit ONE SUB: 410243 DA

Uploaded by

Harshal Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1777

UNIT SUB : 410243 DA

ONE
Sr. No. Questions a b c d Ans

1 Business intelligence (BI) is a broad category

of application programs which
a) Decision
support
b) Data mining c) OLAP d) All of the
mentioned
d
includes _____________

2 BI can catalyze a business’s success

in terms of _____________
a) Distinguish the
products and
b) Rank
customers and
c) Ranks customers d) All of the
and mentioned
d
services locations based locations based on
that drive on proﬁtability probability
revenues
3 Which of the following areas are
affected by BI?
a) Revenue b) CRM c) Sales d) All of the
mentioned
b

4 ________ is a performance management

tool that recapitulates an organization’s
a) Balanced
Scorecard
b) Data Cube c) Dashboard d) All of the
mentioned
a
performance from several standpoints
on a single page.

5 __________ is a system where operations

like data extraction, transformation and
a) Data staging b) Data
integration
c) ETL d) None of the
mentioned
a
loading operations are executed.

6 _________ is a category of applications and

technologies for presenting and analyzing
a) Data
warehouse
b) MIS c) EIS d) All of the
mentioned
c
corporate and external data.

7 Which of the following is the process of a) Institutional

basing an organization’s actions and decisions performance
b) Gap analysis c) Slice and Dice d) None of the
mentioned
a
on actual measured results of performance? management
8 Which of the following does not form part
of BI Stack in SQL Server?
a) SSRS b) SSIS c) SSAS d) OBIEE
d
9 BI can catalyze a business’s success
in terms of _____________
a) Distinguish the
products and
b) Rank
customers and
c) Ranks customers d) All of the
and mentioned
d
services that locations based locations based on
drive revenues on proﬁtability probability

10 This is an approach to selling goods and

services in which
A. customer
managed
B. data mining C. permission
marketing
D. one-to-one
marketing
c
a prospect explicitly agrees in advance to relationship
receive marketing information.

11 In an Internet context, this is the practice of

tailoring Web
a. Web services b. customer-facin c. client/server
g
d. personalizatio
n
d
pages to individual users’ characteristics or
preferences.

12 This is the processing of data about customers a. clickstream

and their analysis
b. database
marketing
c. customer
relationship
d. CRM analytics
d
relationship with the enterprise in order to management
improve the enterprise’s future sales and
service and lower cost.

13 This is a broad category of applications and

technologies for
a. best practice b. data mart c. business
information
d. business
intelligence
d
gathering, storing, analyzing, and providing warehouse
access to data to help enterprise users make
better business decisions.
14 This is a systematic approach to the gathering, a. database
consolidation, marketing
b. marketing
encyclopedia
c. application
integration
d. service
oriented
a
and processing of consumer data (both for integration
customers and potential customers) that is
maintained in a company’s databases.

15 This is an arrangement in which a company

outsources some
a. spend
management
b. supplier
relationship
c. hosted CRM d. Customer
Information
c
or all of its customer relationship management Control System
management functions to an application
service provider (ASP).

16 This is an XML-based metalanguage

developed by the Business
a. BizTalk b. BPML c. e-biz d. ebXML
b
Process Management Initiative (BPMI) as a
means of modeling
business processes, much as XML is, itself, a
metalanguage
with the ability to model enterprise data.

17 This is a central point in an enterprise from

which all customer
a. contact center b. help system c. multichannel
marketing
d. call center
a
contacts are managed.

18 This is the practice of dividing a customer base a. customer

into groups of service chat
b. customer
managed
c. customer life
cycle
d. customer
segmentation
d
individuals that are similar in speciﬁc ways relationship
relevant to marketing, such as age, gender,
interests, spending habits, and so on.
19 In data mining, this is a technique used to
predict future behavior
a. predictive
technology
b. disaster
recovery
c. phase change d. predictive
modeling
d
and anticipate the consequences of change.

20 1. According to analysts, for what can

traditional IT systems provide a foundation
Big data
management
Data Management of
warehousing and Hadoop
Collecting and
storing
a
when and data mining business clusters unstructured
they’re integrated with big data technologies intelligence data
like Hadoop?
21 All of the following accurately describe
Hadoop, EXCEPT:
Open source Real-time Java-based Distributed
computing
b
approach
22 __________ has the world’s largest Hadoop
cluster.
Apple Datamatics Facebook None of the
mentioned
c

23 What are the ﬁve V’s of Big Data? Volume velocity Variety All of the above
d
24 _________ hides the limitations of Java behind
a powerful
Scalding Cascalog Hcatalog Hcalding
b
and concise Clojure API for Cascading.

25 What are the main components of Big Data? MapReduce HDFS YARN All of these
d
26 What are the different features of Big Data
Analytics?
Open-Source Scalability Data Recovery All the above
d
27 Deﬁne the Port Numbers for NameNode, Task NameNode
Tracker and
Task Tracker Job Tracker All of the above
d
Job Tracker.
28 Facebook Tackles Big Data With _______ based Project Prism
on Hadoop
Prism ProjectData ProjectBid
a

29 What is a unit of data that ﬂows through a

Flume agent?
Record Event Row Log
b
30 A feature F1 can take certain value: A, B, C, D,
E, & F and represents grade of students from a
Feature F1 is an
example
Feature F1 is an
example
It doesn’t belong to Both of these
any
b
college. Which of the following statement is of nominal of ordinal of the above
true in the following case variable. variable. category.
31 Which of the following is an example of a
deterministic
PCA K-Means None of the above all of the above
a
algorithm?

32 What is the entropy of the target variable? -(5/8 log(5/8) +

3/8 log(3/8))
5/8 log(5/8) + 3/8 5/8 log(5/8) + 3/8
log(3/8) log(3/8)
5/8 log(3/8) – 3/8
log(5/8)
a

33 Point out the correct statement. a) OLAP is an

umbrella term
b) Business
intelligence
c) BI makes an None of the
organization agile mentioned
b
that equips thereby giving it a
refers to an enterprises to lower edge in
assortment of gain business today’s evolving
software advantage from market condition
applications for data
analyzing an
organization’s
raw data for
intelligent
decision making
34 BI can catalyze a business’s success in terms
of _____________
a) Distinguish the
products
b) Rank
customers and
c) Ranks customers d) All of the
and mentioned
d
and services that locations locations based on
drive revenues based on probability
proﬁtability

35 Which of the following areas are affected by

BI?
a) Revenue b) CRM c) Sales d) All of the
mentioned
b

36 Which of the following does not form part

of BI Stack in SQL Server?
a) SSRS b) SSIS c) SSAS d) OBIEE
d

37 BI can catalyze a business’s success

in terms of _____________
a) Distinguish the
products and
b) Rank
customers and
c) Ranks customers d) All of the
and mentioned
d
services that locations based locations based on
drive revenues on proﬁtability probability

38 Heuristic is A set of
databases from
An approach to a
problem that is
Information that is None of these
hidden in a
b
different not guaranteed to database and that
vendors, possibly work but cannot be
using different performs well in recovered by a
database most cases simple SQL query.
paradigms
39 In an Internet context, this is the practice of
tailoring Web
a. Web services b. customer-facin c. client/server
g
d. personalizatio
n
d
pages to individual users’ characteristics or
preferences.
40 Heterogeneous databases referred to A set of
databases from
An approach to a
problem that is
Information that is None of these
hidden in a
a
different b not guaranteed to database and that
vendors, possibly work but cannot be
using different performs well in recovered by a
database most cases. simple SQL query.
paradigms
UNIT SUB : 410243 DA
TWO
Sr. Questions a b c d Ans
No.
1 Movie Recommendation systems are an example of: Classiﬁcation Clustering Reinforcement Regression
Learning
b,c

2 Sentiment Analysis is an example of: Regression Classiﬁcation Clustering Reinforcement

Learning
a,b,d

3 0 1 2 3
What is the minimum no. of variables/ features
required to perform clustering?
b

4 Is it possible that Assignment of observations to

clusters does not change between successive
Yes No Can’t say None of these
a
iterations in K-Means
5 Which of the following can act as possible
termination conditions in K-Means?
For a ﬁxed
number of
Assignment
of
Centroids do
not change
Terminate when
RSS falls below a
a,b,c,d
iterations. observations between threshold.
to clusters successive
does not iterations.
change
between
iterations.
Except for
cases with a
bad local
minimum.
6 Which of the following clustering algorithms suffers K- Means
from the problem of convergence at local optima? clustering
Expectation-Ma Diverse
Agglomerativ ximization clustering
a,c
algorithm e clustering clustering algorithm
algorithm algorithm

7 Which of the following algorithm is most sensitive to K-means

outliers? clustering
K-medians
clustering
K-modes
clustering
K-medoids
clustering
a
algorithm algorithm algorithm algorithm

8 How can Clustering (Unsupervised Learning) be

used to improve the accuracy of Linear Regression
Creating
different
Creating an
input feature
Creating an
input feature
Creating an
input feature for
a,b,c,d
model (Supervised Learning): models for for cluster for cluster cluster size as a
different ids as an centroids as a continuous
cluster ordinal continuous variable.
groups. variable. variable.
9 What could be the possible reason(s) for producing
two different dendrograms using agglomerative
Proximity of data
function used points used
of variables
used
All of the above
d
clustering algorithm for the same dataset?

10 In which of the following cases will K-Means

clustering fail to give good results?
Data points
with outliers
Data points
with
Data points
with round
Data points with
non-convex
a,b,d
different shapes shapes
densities
11 Which of the following is/are valid iterative strategy Imputation
for treating missing values before clustering with mean
Nearest
Neighbor
mputation with All of the above
Expectation
c
analysis? assignment Maximization
algorithm
12 Feature scaling is an important step before applying In distance
K-Mean algorithm. What is reason behind this? calculation it
You always
get the same
In Manhattan None of these
distance it is an
a
will give the clusters. If important step
same weights you use or but in
for all don’t use Euclidian it is
features feature not
scaling
13 Which of the following method is used for ﬁnding
optimal of cluster in K-Mean algorithm?
Elbow
method
Manhattan
method
Ecludian
mehthod
All of the above
a

14 What is true about K-Mean Clustering? K-means is

extremely
Bad
initialization
Bad
initialization
None of these
d
sensitive to can lead to can lead to bad
cluster center Poor overall
initializations convergence clustering
speed
15 Which of the following can be applied to get good
results for K-means algorithm corresponding to
Try to run Adjust
algorithm for number of
Find out the
optimal
None of these
a,b,c
global minima? different iterations number of
centroid clusters
initialization

16 If you are using Multinomial mixture models with

the expectation-maximization algorithm for
All the data
points follow
All the data
points follow
All the data
points follow
All the data
points follow n
c
clustering a set of data points into two clusters, two Gaussian n Gaussian two multinomial
which of the assumptions are important: distribution distribution multinomial distribution (n
(n >2) distribution >2)
17 Which of the following is/are not true about Centroid Both starts
based K-Means clustering algorithm and
Both are
with random iterative
Both have
strong
Expectation
maximization
d
Distribution based expectation-maximization initializations algorithms assumptions algorithm is a
clustering algorithm: that the data special case of
points must K-Means
fulﬁll
18 It has strong
Which of the following is/are not true about DBSCAN For data
clustering algorithm: points to be in assumptions
It has
substantially
It does not
require prior
b,c
a cluster, they for the high time knowledge of
must be in a distribution complexity of the no. of
distance of data order O(n3) desired clusters
threshold to a points
core point in dataspace

19 Which of the following are the high and low bounds

for the existence of F-Score?
[0,1] (0,1) [-1,1] None of the
above
a

20 1. All of the following increase the width

of a conﬁdence interval except:
a. Increased
conﬁdence
b. Increased c. Increased
variability sample size
d. Decreased
sample size
c
level

21 3The p-value in hypothesis testing represents a. The

which of the following: Please select the best answer probability of
b. The
probability
c. The
probability that
d. The
probability of
d
of those provided below. failing to that the null the observed observing
reject the null hypothesis is results are results as
hypothesis, true, given statistically extreme or more
given the the observed significant, extreme than
observed results given that the currently
results null hypothesis observed, given
is true that the null
hypothesis is
true
22 4. Assume that the difference between the
observed, paired sample values is defined in the
a. Always
True
b. Never
True
c. Sometimes
True
d. Not Enough
Information
a
same manner and that the specified significance
level is the same for both hypothesis tests. Using the
same data, the statement that “a paired/dependent
two sample t-test is equivalent to a one sample t-test
on the paired differences, resulting in the same test
statistic, same p-value, and same conclusion” is:
Please select the best answer of those provided
below.

23 19. Green sea turtles have normally

distributed weights, measured in kilograms, with a
a. 17 kg b. 151 kg c. 118 kg d. 252 kg
c
mean of 134.5 and a variance of 49.0. A particular
green sea turtle’s weight has a z-score of -2.4. What is
the weight of this green sea turtle? Round to the
nearest whole number.
24 What percentage of measurements in a dataset
fall above the median?
a. 49% b. 50% c. 51% d. Cannot Be
Determined
d

25 24. The proportion of variation in 5k race

times that can be explained by the variation in the
a. 0.663 b. 0.814 c. -0.814 d. 0.440
c
age of competitive male runners was approximately
0.663. What is the value of the sample linear
correlation coefficient? Round to 3 decimal places.
26 25. Using all of the results provided, is it a. Yes; linear
reasonable to predict the 5k race time (minutes) of a correlation
b. Yes; both
the
c. No; linear
correlation
d. No; the age
provided
d
competitive male runner 73 years of age? between age sample between age is beyond the
and 5k race linear and 5k race scope of our
times is regression times is not available
statistically equation and statistically sample data
significant an age in significant
years is
provided

27 Algorithm is It uses
machine-lear
Computation
al procedure
Science of
making
None of these
b
ning that takes machines
techniques. some value performs tasks
Here program as input and that would
can learn produces require
from past some value intelligence
experience as output when
and adapt performed by
themselves to humans
new
situations
28 Bias is A class of
learning
Any
mechanism
An approach to None of these
the design of
b
algorithm employed by learning
that tries to a learning algorithms that
find an system to is inspired by
optimum constrain the the fact that
classification search space when people
of a set of of a encounter new
examples hypothesis situations, they
using the often explain
probabilistic them by
theory reference to
familiar
experiences,
adapting the
explanations to
fit the new
situation.
29 Classification is A subdivision
of a set of
A measure of
the accuracy,
The task of
assigning a
None of these
a
examples into of the classification to
a number of classification a set of
classes of a concept examples
that is given
by a certain
theory
30 Binary attribute are This takes
only two
The natural
environment
Systems that
can be used
None of these
a
values. In of a certain without
general, these species knowledge of
values will be internal
0 and 1 operations
and .they can
be coded as
one bit

31 Classiﬁcation accuracy is A subdivision

of a set of
Measure of
the accuracy,
The task of
assigning a
None of these
b
examples into of the classiﬁcation to
a number of classiﬁcation a set of
classes of a concept examples
that is given
by a certain
theory

32 Cluster is Group of
similar
Operations
on a
Symbolic
representation
None of these
a
objects that database to of facts or ideas
differ transform or from which
significantly simplify data information
from other in order to can potentially
objects prepare it for be extracted
a
machine-lear
ning
algorithm
33 A definition of a concept is-----if it recognizes all the Complete
instances of that concept
Consistent Constant None of these
a
34 A definition or a concept is------------- if it classifies
any examples as coming within the concept
Complete Consistent Constant None of these
b

35 Data selection is The actual

discovery
The stage of
selecting the
A
subject-oriente
None of these
b
phase of a right data for d integrated
knowledge a KDD time variant
discovery process non-volatile
process collection of
data in support
of management

36 Classiﬁcation task referred to A subdivision

of a set of
A measure of
the accuracy,
The task of
assigning a
None of these
c
examples into of the classification to
a number of classification a set of
classes of a concept examples
that is given
by a certain
theory
37 Hybrid is Combining
different
Approach to
the design of
Decision
support
None of these
a
types of learning systems that
method or algorithms contain an
information that is information
structured base filled with
along the the knowledge
lines of the of an expert
theory of formulated in
evolution. terms of if-then
rules.

38 Discovery is It is hidden
within a
The process
of executing
An extremely None of these
complex
b
database and implicit molecule that
can only be previously occurs in
recovered if unknown human
one is given and chromosomes
certain clues potentially and that carries
(an example useful genetic
IS encrypted information information in
information). from data the form of
genes.

39 What could be the possible reason(s) for producing

two different dendrograms using agglomerative
Proximity of data
function used points used
of variables
used
All of the above
d
clustering algorithm for the same dataset?
40 Is it possible that Assignment of observations to
clusters does not change between successive
Yes No Can’t say None of these
a
iterations in K-Means
UNIT SUB : 410243 DA
THREE
Sr. No. Questions a b c d Ans

1 This clustering algorithm terminates when mean values

computed for the current iteration of the algorithm are
K-Means
clustering
conceptual
clustering
expectation agglomerativ
maximizatio e clustering
a
identical to the computed mean values for the previous n
iteration
2 The correlation coeﬃcient for two real-valued attributes is
–0.85. What does this value tell you?
The attributes
are not
As the value
of one
As the value The attributes
of one show a linear
b
linearly attribute attribute relationship
related. decreases the increases the
value of the value of the
second second
attribute attribute also
increases. increases.

3 Given a rule of the form IF X THEN Y, rule conﬁdence is

deﬁned as the conditional probability that
Y is false
when X is
Y is true when X is true
X is known to when Y is
X is false
when Y is
b
known to be be true. known to be known to be
false. true false.
4 Chameleon is Density based Partitioning
clustering based
Model based Hierarchical
algorithm clustering
d
algorithm algorithm algorithm

5 Find odd man out DBSCAN K-Mean PAM None of

above
a
6 The number of iterations in apriori ___________ increases
with the size
decreases
with the
increases
with the size
decreases
with increase
c
of the data increase in of the in size of the
size of the maximum maximum
data frequent set frequent set
7 Which of the following are interestingness measures for
association rules?
Recall Lift Accuracy All of Above
b
8 Given a frequent itemset L, If |L| = k, then there are 2k – 1
candidate
2k candidate 2k – 2
association candidate
2k -2
candidate
c
association rules association association
rules rules rules
9 _________ is an example for case based-learning Decision trees Neural
networks
Genetic
algorithm
K-nearest
neighbor
d

10 The average positive difference between computed and

desired outcome values.
mean positive mean
error
mean
squared error absolute
root mean
squared error
c
error
11 Frequent item sets is Superset of
only closed
Superset of Subset of
only maximal maximal
Superset of
both closed
d
frequent item frequent item frequent item frequent item
sets sets sets sets and
maximal
frequent item
sets
12 Assume that we have a dataset containing information about 63 38 40 89
200 individuals. A supervised data mining session has
b
discovered the following rule: IF age < 30 & credit card
insurance = yes THEN life insurance = yes Rule Accuracy:
70% and Rule Coverage: 63% How many individuals in
the class life insurance= no have credit card insurance and
are less than 30 years old?
13 Which of the following is cluster analysis? Simple Grouping
segmentation similar
Labeled Query results
classiﬁcation grouping
b
objects
14 A good clustering method will produce high quality clusters
with
high inter
class
high intra
class
low intra
class
None of
above
c
similarity similarity similarity
15 Which two parameters are needed for DBSCAN Min threshold Min points
and eps
Min sup and
min
Number of
centroids
b
conﬁdence
16 Which statement is true about neural network and linear
regression models?
Both
techniques
The output of
both models
Both models
require
Both models
require input
d
build models is a numeric attributes to
whose output categorical attributes to be numeric.
is determined attribute range
by a linear value. between 0
sum of and 1.
weighted
input
attribute
values.
17 In Apriori algorithm, if 1 item-sets are 100, then the number 100 200 4950 5000
of candidate 2 item-sets are
c

18 Signiﬁcant Bottleneck in the Apriori algorithm is Finding

frequent
Pruning Candidate
generation
Number of
iterations
c
itemsets
19 Machine learning techniques differ from statistical
techniques in that machine learning methods
are better
able to deal
typically
assume an
have trouble
with
are not able
to explain
a
with missing underlying large-sized their
and noisy distribution datasets behavior.
data for the data
20 The probability of a hypothesis before the presentation of
evidence.
a priori posterior conditional subjective
a

21 KDD represents extraction of data knowledge rules model

b
22 Which statement about outliers is true? Outliers
should be
Outliers
should be
The nature of
the problem
Outliers
should be
c
part of the identiﬁed and determines part of the
training removed how outliers test dataset
dataset but from a are used but should
should not be dataset. not be
present in the present in the
test data. training data.

23 The most general form of distance is Manhattan Eucledian Mean Minkowski

24 Which Association Rule would you prefer High support High support Low support
and medium and low and high
Low support
and low
c
confidence confidence confidence confidence

25 In a Rule based classiﬁer, If there is a rule for each

combination of attribute values, what do you called that rule
Exhaustive Inclusive Comprehensi Mutually
ve exclusive
a
set R

26 The apriori property means If a set cannot

pass a test, its
To decrease
the eﬃciency,
To improve
the eﬃciency,
If a set can
pass a test, its
a
supersets will do level-wise do level-wise supersets will
also fail the generation of generation of fail the same
same test frequent item frequent item test
sets sets

27 If an item set ‘XYZ’ is a frequent item set, then all subsets of
that frequent item set are
Undeﬁned Not frequent Frequent Can not say
c
28 0.0368 0.0396 0.0389 0.0398
The probability that a person owns a sports car given that
they subscribe to automotive magazine is 40%. We also
b
know that 3% of the adult population subscribes to
automotive magazine. The probability of a person owning a
sports car given that they donâ€™t subscribe to automotive
magazine is 30%. Use this information to compute the
probability that a person subscribes to automotive magazine
given that they own a sports car

29 Simple regression assumes a __________ relationship between quadratic

the input attribute and output attribute.
inverse linear reciprocal
c

30 To determine association rules from frequent item sets Only

minimum
Neither
support not
Both
minimum
Minimum
support is
c
confidence confidence support and needed
needed needed confidence
are needed
31 If {A,B,C,D} is a frequent itemset, candidate rules which is
not possible is
C –> A D –>ABCD A –> BC B –> ADC
b
32 Which Association Rule would you prefer High support Low support
and low and high
Low support
and low
High support
and medium
b
confidence confidence confidence confidence

33 Classiﬁcation rules are extracted from _____________ decision tree root node branches siblings
a
34 What does K refers in the K-Means algorithm which is a
non-hierarchical clustering approach?
Complexity Fixed value No of
iterations
. number of
clusters
d
35 If Linear regression model perfectly ﬁrst i.e., train error is
zero, then _____________________
Test error is
also always
Test error is
non zero
Couldn’t
comment on
Test error is
equal to Train
c
zero Test error error

36 Which of the following metrics can be used for evaluating

regression models? i)R Squared ii) Adjusted R Squared iii) F
ii and iv i and ii ii, iii and iv i, ii, iii and iv
d
Statistics iv) RMSE/MSE/MAE

37 1 2 3 4
How many coeﬃcients do you need to estimate in a simple
linear regression model (One independent variable)?
b

38 In a simple linear regression model (One independent

variable), If we change the input variable by 1 unit. How
by 1 no change by intercept by its slope
d
much output variable will change?

39 In syntax of linear model lm(formula,data,..), data refers to

______
Matrix array vector list
c
40 In the mathematical Equation of Linear Regression Y = β1 + (X-intercept, (Slope,
β2X + ϵ, (β1, β2) refers to __________ Slope) X-Intercept)
(Y-Intercept,
Slope)
(slope,
Y-Intercept)
c
UNIT SUB : 410243 DA
FOUR

Sr. No. Questions a b c d Ans

1 A _________ is a decision support tool that uses a tree-like

graph or model of decisions and their possible
Decision tree Graphs Trees Neural
Networks
a
consequences, including chance event outcomes, resource
costs, and utility.
3 What is Decision Tree? Structure in
which
Flow-Chart & None of
Structure in Above
c
internal node which
represents internal node
test on an represents
attribute, test on an
each branch attribute,
represents each branch
outcome of represents
test and each outcome of
leaf node test and each
represents leaf node
class label represents
class label

4 Decision Trees can be used for Classiﬁcation Tasks. TRUE FALSE

a
5 Choose from the following that are Decision Tree nodes? Decision
Nodes
End Nodes Chance Nodes All of Above
d
6 Decision Nodes are represented by ____________ Disks Squares Circles Triangles
b
7 Chance Nodes are represented by __________ Disks Squares Circles Triangles
c
8 End Nodes are represented by __________ Disks Squares Circles Triangles
d
9 Which of the following are the advantage/s of Decision
Trees?
Possible Use a white Worst, best
Scenarios can box model, If and expected
All of Above
d
be added given result is values can be
provided by a determined
model for different
scenarios

10 Which of the following statements about Naive Bayes is

incorrect?
Attributes are Attributes
equally are
Attributes are
statistically
Attributes
can be
b
important. statistically independent nominal or
dependent of of one numeric
one another another given
given the the class
class value. value.
11 Which of the following is not supervised learning? Clustering Decision Tree Linear
Regression
Naive
Bayesian
a

12 1 2 3 4
How many terms are required for building a bayes model?
c
13 Where does the bayes rule can be used? Solving
queries
Increasing
complexity
Decreasing
complexity
Answering
probabilistic
d
query

16 Bayesian classiﬁers is A class of

learning
Any
mechanism
An approach None of these
to the design
a
algorithm that employed by of learning
tries to find a learning algorithms
an optimum system to that is
classification constrain the inspired by
of a set of search space the fact that
examples of a when people
using the hypothesis encounter
probabilistic new
theory. situations,
they often
explain them
by reference
to familiar
experiences,
adapting the
explanations
to fit the new
situation.
17 Bias is A class of
learning
Any
mechanism
An approach None of these
to the design
b
algorithm that employed by of learning
tries to find a learning algorithms
an optimum system to that is
classification constrain the inspired by
of a set of search space the fact that
examples of a when people
using the hypothesis encounter
probabilistic new
theory situations,
they often
explain them
by reference
to familiar
experiences,
adapting the
explanations
to fit the new
situation.

18 Background knowledge referred to Additional

acquaintance
A neural It is a form of None of these
network that automatic
a
used by a makes use of learning.
learning a hidden
algorithm to layer
facilitate the
learning
process
19 Classification accuracy is A subdivision
of a set of
A measure of
the accuracy,
The task of
assigning a
None of these
b
examples into of the classification
a number of classification to a set of
classes of a concept examples
that is given
by a certain
theory
20 Classification is A subdivision
of a set of
A measure of
the accuracy,
The task of
assigning a
None of these
a
examples into of the classification
a number of classification to a set of
classes of a concept examples
that is given
by a certain
theory

21 Discovery is It is hidden
within a
The process
of executing
An extremely None of these
complex
b
database and implicit molecule that
can only be previously occurs in
recovered if unknown human
one is given and chromosomes
certain clues potentially and that
(an example useful carries
IS encrypted information genetic
information). from data information
in the form of
genes.
22 Classification task referred to A subdivision
of a set of
A measure of
the accuracy,
The task of
assigning a
None of these
c
examples into of the classification
a number of classification to a set of
classes of a concept examples
that is given
by a certain
theory

23 Euclidean distance measure is A stage of the

KDD process
The process
of ﬁnding a
The distance None of these
between two
c
in which new solution for a points as
data is added problem calculated
to the existing simply by using the
selection. enumerating Pythagoras
all possible theorem
solutions
according to
some
pre-deﬁned
order and
then testing
them

24 The problem of ﬁnding hidden structure in unlabeled data

is called
Supervised
learning
Unsupervised Reinforceme None of these
learning nt learning
b
25 Assume you want to perform supervised learning and to
predict number of newborns according to size of storks’
Classiﬁcation Regression Clustering Structural
equation
b
population modeling
(http://www.brixtonhealth.com/storksBabies.pdf), it is an
example of

26 Discriminating between spam and ham e-mails is a

classiﬁcation task, true or false?
TRUE FALSE
a

27 which of the following is not involve in data mining? Knowledge

extraction
Data
archaeology
Data
exploration
Data
transformatio
d
n

28 Naive prediction is A class of

learning
A table with
n
A prediction None of these
made using
c
algorithms independent an extremely
that try to attributes can simple
derive a be seen as an method, such
Prolog n- as always
program from dimensional predicting the
examples space. same output.

29 Node is A component In the context

of a network of KDD and
One of the
defining
None of these
a
data mining, aspects of a
this refers to data
random warehouse
errors in a
database
table.
30 Prediction is The result of
the
One of
several
Discipline in None of these
statistics that
a
application of possible studies ways
a theory or a enters within to find the
rule in a a database most
specific case table that is interesting
chosen by the projections of
designer as multi-dimens
the primary ional spaces.
means of
accessing the
data in the
table.

31 What is the relation between the distance between clusters

and the corresponding class discriminability?
proportional inversely-pro no-relation
portional
None of these
a

32 the classiﬁcation method in which the upper limit of interval exclusive

is same as of lower class interval is called…. method
inclusive
method
mid point
method
None of these
a

33 20 25 4 15
larger value is 60 and the smallest value is 40 and the
number of classes is 5 then the class interval is
c

34 summary and presentation of data in tabular form with

several non overlapping classes is referred as
nominal
distribution
frequency
distribution
ordinal
distribution
None of these
b

35 the classification method in which the upper and lower limit exclusive
of interval is also in class interval itself is called…. method
inclusive
method
mid point
method
None of these
b
36 0.05 0.06 0.07 0.08
Suppose there are 25 base classifiers. Each classifier has
error rates of e = 0.35. Suppose you are using averaging as
b
ensemble of above 25 classifiers will make a wrong
prediction? Note: all classifiers are independent of each
other
37 The most widely used metrics and tools to assess a
classification model are:
Confusion
matrix
Cost-sensitive Area under
accuracy the ROC curve
All of Above
d

38 When performing regression or classiﬁcation, which of the

following is the correct way to preprocess the data?
Normalize the PCA →
data → PCA → normalize
Normalize
the data →
None of these
a
training PCA output → PCA →
training normalize
PCA output →
training
39 Which of the following is true about Naive Bayes ? Assumes that Assumes that both a and b None of these
all the all the
c
features in a features in a
dataset are dataset are
equally independent
important

40 In which of the following cases will K-means clustering fail

to give good results? 1) Data points with outliers 2) Data
1 and 2 2 and 3 1, 2, and 3 1 and 3
c
points with different densities 3) Data points with
nonconvex shapes
UNIT SUB : 410243 DA
FIVE
Sr. Questions a b c d Ans
No.
1 Data visualtization is realted with… Pictorial
representaion
numerical
representatio
numerical
calculations
None of these
a
s n
2 Which of the following are Use of data visualtization See context of
data
Clear data
understandin
ﬁnding
pattern in
all of above
d
g data
3 Which of the following statements are true about using
visualizations to display a dataset? I. Visualizations are visually
I AND II II AND III I AND III ONLY III
d
appealing, but don’t help the viewer understand relationships
that exist in the data

II. Visualizations like graphs, charts, or visualizations with

pictures are useful for conveying information, while tables just
ﬁlled with text are not useful.

III. Patterns that exist in the data can be found more easily by
using a visualization

4 The plot method on Series and DataFrame is just a simple

wrapper around ____________
gplt.plot() plt.plot() plt.plotgraph( none of the
) mentioned
b

5 Point out the correct combination with regards to kind keyword ‘hist’ for
for graph plotting. histogram
‘box’ for
boxplot
‘area’ for
area plots
all of the
mentioned
d
6 Which of the following value is provided by kind keyword for
barplot?
bar bar bar none of the
mentioned
a

7 You can create a scatter plot matrix using the __________ method sca_matrix
in pandas.tools.plotting.
scatter_matri DataFrame.pl all of the
x ot mentioned
b

8 Plots may also be adorned with error bars or tables. True FALSE Cannot Tell All Above
a
9 Which of the following plots are often used for checking
randomness in time series?
Autocausation Autorank Autocorrelati none of the
on mentioned
c

10 __________ plots are used to visually assess the uncertainty of a

statistic
Lag RadViz Bootstrap All Above
c
11 Which of the following is not a challenge in Big Data
Visualization>?
Velocity Volume Version Variety
c
12 Which of the following is not a problem in Big Data
Visualization>?
Visual Noise Scaled Data Large image
perception
Information
Loss
b

13 Which of the following is a problem in Big Data Visualization>? Structured

Data
Scaled Data Visual Noise Multiple
valued Data
c
14 Which of the candidate is suitable for interactive visualtization? Type of Visual Cardinality Size of data all of above
d
15 Which of the following follows interactive visualization
approach?
Zoom+Pan Focus+Contex Overview+De all of above
t tails
d

16 Visual Mapping is important for_______ Remapping Overview+De Focus

tails
Context
a
17 Data visualtization techniques are: Scatter Plot Line Chart Pie Chart all of above
d
18 Information Visualtization techniques are Flow Chart Time Line DFD All of above
d
19 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these
c
20 Information Visualtization techniques are Flow Chart Line Chart Pie Chart None of these
a
21 Data visualtization techniques are: Scatter Plot Time Line DFD None of these
a
22 Information Visualtization techniques are Scatter Plot Time Line Bubble Chart None of these
b
23 Data visualtization techniques are: Histogram Parallel
Coordinates
Time Line None of these
a
24 Information Visualtization techniques are Semantic
Network
Histogram Area Chart None of these
a
25 Which of the following is realted term with correlation? Exponential U-Shape Null All of above
d
26 Data visualtization techniques are: Scatter Plot Time Line DFD None of these
a
27 Coulmn graph is another name for _____ Bar Chart Scatterplot Histogram Area Chart
a
28 Which of the following follows interactive visualization
approach?
Zoom+Pan Focus+Contex Overview+De all of above
t tails
d

29 information Visualtization techniques are Pie Chart Scatterplot Histogram Area Chart
a
30 Which of the following is category of timeline? Linear
Timeline
Modular
Timeline
Variant
Timeline
ER Timeline
a

31 Which of the following speciﬁes relationship amongst

variables?
Scatter Plot Line Chart Area Chart All of above
d
32 Which of the following speciﬁes category Proportions? Pie Chart Histogram Bar chart All of above
d
33 Which of the following is category of timeline? Variant
Timeline
ER Timeline Comarative
Timeline
Modular
Timeline
c

34 Information Visualtization techniques are Flow Chart Time Line DFD All of above
d
35 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these
c
36 Data visualtization is realted with… Pictorial
representaion
numerical
representatio
numerical
calculations
None of these
a
s n
37 Which of the following follows interactive visualization
approach?
Zoom+Pan Focus+Contex
t
Overview+De all of above
tails
d

38 Which of the following are Use of data visualtization See context of Clear data
data
ﬁnding
understandin pattern in
all of above
d
g data
39 Which of the following speciﬁes relationship amongst
variables?
Pie Chart Histogram Area Chart None of these
c

40 Which of the following speciﬁes category Proportions? Pie Chart Scatter Plot Line Chart None of these
a
UNIT SUB : 410243 DA
SIX
Sr. No. Questions a b c d Ans

1 Precies and steady format data is____ Structured

Data
Un Structured semi
Data Structured
Quasi
Structured
a
Data Data
2 Inconsistant Data is______ Structured
Data
Un Structured semi
Data Structured
Quasi
Structured
b
Data Data
3 Format that self deﬁnes itself is________ Structured
Data
Un Structured semi
Data Structured
Quasi
Structured
c
Data Data
4 A little Bit inconsistant data is_______ Structured
Data
Un Structured semi
Data Structured
Quasi
Structured
d
Data Data
5 XML is an example of_______ Structured Un Structured semi Quasi
Data Data Structured Structured
Data Data
6 RDBMS Folllows__________ Structured
Data
Un Structured semi
Data Structured
Quasi
Structured
a
Data Data
7 Watson is developed by____ IBM Microsoft AT&T Google
a
8 Hadoop is _____ based Framework. C++ Python JAVA C#
c
9 Which of the following are components of
Hadoop?
MAPREDUCE YARN HDFS All of Above
d
10 Which of the following are components of
HIVE?
JDBC Thrift Server CLI All of Above
d
11 Mahout provides__________ JAVA
Executable
C#
Executables
Mountable
Image Format
All of Above
a
Libraries

12 Which of the following are components of

HIVE?
FLATTEN Thrift Server Muster None of these
b
13 Which of the following are components of
HIVE?
FLATTEN Thrift Server Muster All of above
b
14 Which of the following is components of
Hadoop?
Fork YARN CLI Metadata
b
15 RDBMS Folllows__________ Structured
Data
Un Structured semi
Data Structured
Quasi
Structured
a
Data Data

16 Which of the following is a clustering

techique?
Fuzzy K
means
Canopy K-Means All of above
d
17 Which of the following is HBASE Data Model
Terminology?
Row Table Column All of Above
d
18 Which of the following is not a classiﬁcation
techique?
Logistic
Regression
Random
Forest
Recommende Naïve Bayes
r Algo
c

19 Which of the following is a classiﬁcation

techique?
Logistic
Regression
Random
Forest
Naïve Bayes All of Above
d

20 Which of the following is HBASE Data Model

Terminology?
Column
Family
Cell Timestamp All of Above
d
21 Which of the following is a clustering
techique?
Logistic
Regression
Random
Forest
K-Means Naïve Bayes
c

22 Which of the following is HBASE Data Model

Terminology?
Identiﬁer Variant Timestamp None of the
above
c
23 Which of the following is not a classiﬁcation
techique?
Logistic
Regression
Random
Forest
K-Means Naïve Bayes
c

24 Which of the following are components of

HIVE?
FLATTEN Thrift Server Muster None of these
b
25 Which of the following is HBASE Data Model
Terminology?
Identiﬁer Variant Column
Qualiﬁer
None of the
above
c
26 Mahout provides__________ JAVA
Executable
C#
Executables
Mountable None of the
Image Format above
a
Libraries

27 Which of the following is not a clustering

techique?
Logistic
Regression
Canopy K-Means Fuzzy K
means
a

28 Which of the following is a clustering

techique?
Fuzzy K
means
Canopy K-Means All of above
d
29 Point out the correct statement. Hadoop do
need
Hadoop 2.0
allows live
In Hadoop None of the
programming above
b
specialized stream framework
hardware to processing of output ﬁles
process the real-time data are divided
data into lines or
records
30 What was Hadoop named after? Creator Doug Cutting’s high The toy
Cutting’s school rock elephant of
A sound
Cutting’s
c
favorite band Cutting’s son laptop made
circus act during
Hadoop
development
31 MapReduce Mahout Oozie None of the
___________programming model used to
develop Hadoop-based applications that can above
a
process massive amounts of data.

Which of the following is not a classification Logistic Random K-Means Naïve Bayes
32
techique? Regression Forest
c
Which of the following are components of HIVE? FLATTEN Thrift Server Muster All of above
33 b

Which of the following is components of Fork YARN CLI None of

34
Hadoop? above
b

Hadoop is a framework that works with a variety MapReduce, MapReduce, MapReduce, All of above
35
of related tools. Common cohorts include Hive and MySQL and Hummer and
a
____________ HBase Google Apps Iguana

36 NoSQL databases is used mainly for handling Structured

large volumes of ______________ data. Data
Un Structured semi
Data Structured
Quasi
Structured
b
Data Data

37 Which of the following is not a phase of Data

Analytics Life Cycle?
Communicati Recall
on
Data
Preparation
Model
Planning
b

38 Which of the following is a NoSQL Database

Type?
SQL Document
databases
JSON All of above
b
39 Which of the following is not a NoSQL
database
SQL Server MongoDB Cassandra None of the
above
a
1. Data Analysis is a process of?
A. inspecting data
B. cleaning data
C. transforming data
D. All of the above
View Answer
Ans : D

Explanation: Data Analysis is a process of inspecting, cleaning, transforming and

modeling data with the goal of discovering useful information, suggesting
conclusions and supporting decision-making.

2. Which of the following is not a major data analysis

approaches?
A. Data Mining
B. Predictive Intelligence
C. Business Intelligence
D. Text Analytics
View Answer
Ans : B

Explanation: Predictive Analytics is major data analysis approaches not Predictive

Intelligence.

3. How many main statistical methodologies are used in data

analysis?
A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A

Explanation: In data analysis, two main statistical methodologies are used

Descriptive statistics and Inferential statistics.
4. In descriptive statistics, data from the entire population or a
sample is summarized with ?
A. integer descriptors
B. floating descriptors
C. numerical descriptors
D. decimal descriptors
View Answer
Ans : C

Explanation: In descriptive statistics, data from the entire population or a sample is

summarized with numerical descriptors.

5. Data Analysis is defined by the statistician?

A. William S.
B. Hans Peter Luhn
C. Gregory Piatetsky-Shapiro
D. John Tukey
View Answer
Ans : D

Explanation: Data Analysis is defined by the statistician John Tukey in 1961 as

"Procedures for analyzing data.

6. Which of the following is true about hypothesis testing?

A. answering yes/no questions about the data
B. estimating numerical characteristics of the data
C. describing associations within the data
D. modeling relationships within the data
View Answer
Ans : A

Explanation: answering yes/no questions about the data (hypothesis testing)

7. The goal of business intelligence is to allow easy
interpretation of large volumes of data to identify new
opportunities.
A. TRUE
B. FALSE
C. Can be true or false
D. Can not say
View Answer
Ans : A

Explanation: The goal of business intelligence is to allow easy interpretation of large

volumes of data to identify new opportunities.

8. The branch of statistics which deals with development of

particular statistical methods is classified as
A. industry statistics
B. economic statistics
C. applied statistics
D. applied statistics
View Answer
Ans : D

Explanation: The branch of statistics which deals with development of particular

statistical methods is classified as applied statistics.

9. Which of the following is true about regression analysis?

A. answering yes/no questions about the data
B. estimating numerical characteristics of the data
C. modeling relationships within the data
D. describing associations within the data
View Answer
Ans : C

Explanation: modeling relationships within the data (E.g. regression analysis).

10. Text Analytics, also referred to as Text Mining?

A. TRUE
B. FALSE
C. Can be true or false
D. Can not say
View Answer
Ans : A

Explanation: Text Data Mining is the process of deriving high-quality information

from text.
SUB : 410243 DA

Sr.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
Which of the following is not an example of Social Media?
a. Twitter
1. b. Google
c. Insta
d. Youtube
By 2025, the volume of digital data will increase to
a. TB
2. b. YB
c. ZB
d. EB
For Drawing insights for Business what are need?
a. Collecting the data
3. b. Storing the data
c. Analysing the data
d. All the above
Does Facebook uses "Big Data " to perform the concept of Flashback? Is this True or
False.
4.
a. TRUE
b. FALSE
The Process of describing the data that is huge and complex to store and process is known
as
a. Analytics
5.
b. Data mining
c. Big Data
d. Data Warehouse
Data generated from online transactions is one of the example for volume of big data. Is
this true or False.
6.
a. TRUE
b. FALSE
Velocity is the speed at which the data is processed
7. a. TRUE
b. FALSE
have a structure but cannot be stored in a database.
a. Structured
8. b. Semi-Structured
c. Unstructured
d. None of these
refers to the ability to turn your data useful for business.
a. Velocity
9. b. Variety
c. Value
d. Volume
SUB : 410243 DA

Value tells the trustworthiness of data in terms of quality and accuracy.

10. a. TRUE
b. FALSE
GFS consists of a Master and Chunk Servers
a. Single, Single
11. b. Multiple, Single
c. Single, Multiple
d. Multiple, Multiple
Files are divided into sized Chunks.
a. Static
12. b. Dynamic
c. Fixed
d. Variable
is an open source framework for storing data and running application on
clusters of commodity hardware.
a. HDFS
13.
b. Hadoop
c. MapReduce
d. Cloud
HDFS Stores how much data in each clusters that can be scaled at any time?
a. 32
14. b. 64
c. 128
d. 256
Hadoop MapReduce allows you to perform distributed parallel processing on large
volumes of data quickly and efficiently… is this MapReduce or Hadoop… i.e statement is
15. True or False
a. TRUE
b. FALSE
Hortonworks was introduced by Cloudera and owned by Yahoo.
16. a. TRUE
b. FALSE
Hadoop YARN is used for Cluster Resource Management in Hadoop Ecosystem.
17. a. TRUE
b. FALSE
Google Introduced MapReduce Programming model in 2004.
18. a. TRUE
b. FALSE
phase sorts the data & creates logical clusters.
a. Reduce, YARN
b. MAP, YARN
19.
c. REDUCE, MAP
d. MAP, REDUCE
SUB : 410243 DA

There is only one operation between Mapping and Reducing is it True or False…
a. TRUE
20.
b. FALSE

is factors considered before Adopting Big Data Technology.

a. Validation
21. b. Verification
c. Data
d. Design
for improving supply chain management to optimize stock management,
replenishment, and forecasting;
a. Descriptive
22.
b. Diagnostic
c. Predictive
d. Prescriptive
which among the following is not a Data mining and analytical applications?
a. profile matching
23. b. social network analysis
c. facial recognition
d. Filtering
as a result of data accessibility, data latency, data availability, or limits
on bandwidth in relation to the size of inputs.
a. Computation-restricted throttling
24.
b. Large data volumes
c. Data throttling
d. Benefits from data parallelization
As an example, an expectation of using a recommendation engine would be to increase
same-customer sales by adding more items into the market basket.
a. Lowering costs
25.
b. Increasing revenues
c. Increasing productivity
d. Reducing risk
Which storage subsystem can support massive data volumes of increasing size.
a. Extensibility
26. b. Fault tolerance
c. Scalability
d. High-speed I/O capacity
provides performance through distribution of data and fault tolerance
through replication
a. HDFS
27. b. PIG
c. HIVE
d. HADOOP
SUB : 410243 DA

is a programming model for writing applications that can process Big

Data in parallel on multiple nodes.
a. HDFS
28. b. MAP REDUCE
c. HADOOP
d. HIVE

takes the grouped key-value paired data as input and runs a

Reducer function on each one of them.
a. MAPPER
29. b. REDUCER
c. COMBINER
d. PARTITIONER

is a type of local Reducer that groups similar data from the map phase
into identifiable sets.
a. MAPPER
30. b. REDUCER
c. COMBINER
d. PARTITIONER

While Installing Hadoop how many xml files are edited and list them ?
i. core-site.xml
ii. hdfs-site.xml
31.
iii. mapred.xml
iv. yarn.xml

Write the code for core-site.xml ?

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>D:\hadoop\temp</value>
32.
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:50071</value>
</property>
</configuration>

</?xml >
33. Write the code for hdfs-site.xml ?
SUB : 410243 DA

Sr.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
Movie Recommendation systems are an example of
1. Classification 2. Clustering 3. Reinforcement Learning 4. Regression
a. 2 Only
1.
b. 1 and 2
c. 1 and 3
d. 2 and 3
Sentiment Analysis is an example of
1. Regression 2. Classification 3. Clustering 4 Reinforcement Learning
a. 1, 2 and 4
2.
b. 1 and 3
c. 1, 2 and 3
d. 1 and 2
Can decision trees be used for performing clustering?
3. a. True
b. False
What is the minimum no. of variables/ features required to perform clustering?
1. 0
4. 2. 1
3. 2
4. 3
For two runs of K-Mean clustering is it expected to get same clustering results?
5. 1. Yes
2. No
Which of the following can act as possible termination conditions in K-Means?
1. For a fixed number of iterations.
2. Assignment of observations to clusters does not change between iterations. Except for
cases with a bad local minimum.
3. Centroids do not change between successive iterations. 4.Terminate when RSS falls
6.
below a threshold.
a. 1, 3 and 4
b. 1, 2 and 3
c. 1, 2 and 4
d. All of the above
Which of the following algorithm is most sensitive to outliers?
1. K-means clustering algorithm
7. 2. K-medians clustering algorithm
3. K-modes clustering algorithm
4. K-medoids clustering algorithm
After performing K-Means Clustering analysis on a dataset, you observed the following
8.
dendrogram. Which of the following conclusion can be drawn from the dendrogram?
SUB : 410243 DA

a. There were 28 data points in clustering analysis

b. The best no. of clusters for the analyzed data points is 4
c. The proximity function used is Average-link clustering
d. The above dendrogram interpretation is not possible for K-Means clustering
analysis
In the figure below, if you draw a horizontal line on y- axis for y=2. What will be the
number of clusters formed?

1. 1
2. 2
3. 3
4. 4
In which of the following cases will K-Means clustering fail to give good results?
1. Data points with outliers
2. Data points with different densities
3. Data points with round shapes
10. 4. Data points with non-convex shapes
a. 1 and 2
b. 2 and 3
c. 2 and 4
d. 1, 2 and 4
The discrete variables and continuous variables are two types of
a. Open end classification
11. b. Time series classification
c. Qualitative classification
d. Quantitative classification
SUB : 410243 DA

Bayesian classifiers is
1. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory.
2. Any mechanism employed by a learning system to constrain the search space of a
12. hypothesis
3. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
4. None of these
Classification accuracy is
1. A subdivision of a set of examples into a number of classes
2. Measure of the accuracy, of the classification of a concept that is given by a
13.
certain theory
3. The task of assigning a classification to a set of examples
4. None of these
Classification task referred to
1. A subdivision of a set of examples into a number of classes
2. A measure of the accuracy, of the classification of a concept that is given by a
14.
certain theory
3. The task of assigning a classification to a set of examples
4. None of these
Euclidean distance measure is
1. A stage of the KDD process in which new data is added to the existing selection.
2. The process of finding a solution for a problem simply by enumerating all possible
15.
solutions according to some pre-defined order and then testing them
3. The distance between two points as calculated using the Pythagoras theorem
4. None of these
is good at handle missing data and support both the kind of
attributes ( i.e Categorial and Continuous attributes )
a. ID3.
16.
b. C4.5.
c. CART.
d. Naïve Bayes.
Decision trees use , in that they always choose the option
that seems the best available at that moment.
a. Greedy Algorithms.
17.
b. Divide and Conquer.
c. Backtracking.
d. Shortest Path Method.
Decision trees cannot handle categorical attributes with many distinct values, such as
country codes for telephone numbers.
18.
a. TRUE
b. FALSE
19. are easy to implement and can execute efficiently even without
SUB : 410243 DA

prior knowledge of the data, they are among the most popular algorithms for classifying
text documents.
a. ID3
b. Naïve Bayes classifiers
c. CART
d. None of these.
High entropy means that the partitions in classification are
a. Pure
20. b. Not pure
c. Useful
d. Useless
Which of the following statements about Naive Bayes is incorrect?
a. Attributes are equally important.
21. b. Attributes are statistically dependent of one another given the class value.
c. Attributes are statistically independent of one another given the class value.
d. Attributes can be nominal or numeric
The maximum value for entropy depends on the number of classes so if we have 8 Classes
what will be the max entropy.

22.
a. Max Entropy is 1
b. Max Entropy is 2
c. Max Entropy is 3
d. Max Entropy is 4
John flies frequently and likes to upgrade his seat to first class. He has determined that if
he checks in for his flight at least two hours early, the probability that he will get an
upgrade is 0.75; otherwise, the probability that he will get an upgrade is 0.35. With his
busy schedule, he checks in at least two hours before his flight only 40% of the time.
Suppose John did not receive an upgrade on his most recent attempt. What is the
23.
probability that he did not arrive two hours early?
a. 0.892
b. 0.796
c. 0.685
d. 0.999
Point out the wrong statement.
a. k-nearest neighbor is same as k-means
24. b. k-means clustering is a method of vector quantization
c. k-means clustering aims to partition n observations into k clusters
d. none of the mentioned
Consider the following example “How we can divide set of articles such that those articles
have the same theme (we do not know the theme of the articles ahead of time) " is this:
25.
1. Clustering
2. Classification
3. Regression
4. None of These
SUB : 410243 DA

Can we use K Mean Clustering to identify the objects in video?

26. 1. Yes
2. No
Clustering techniques are in the sense that the data scientist
does not determine, in advance, the labels to apply to the clusters.
1. Unsupervised
27.
2. Supervised
3. Reinforcement
4. Neural network

Sr.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
metric is examined to determine a reasonably optimal value of
k.
1. Mean Square Error
1.
2. Within Sum of Squares (WSS)
3. Speed
4. None of These
If an itemset is considered frequent, then any subset of the frequent itemset must also be
frequent.
1. Apriori Property
2.
2. Downward Closure Property
3. Either 1 or 2
4. Both 1 & 2
if {bread,eggs,milk} has a support of 0.15 and {bread,eggs} also has a support of 0.15, the
confidence of rule {bread,eggs}→{milk} is
1. 0
3.
2. 1
3. 2
4. 3
Confidence is a measure of how X and Y are really related rather than coincidentally
happening together.
4.
a. True
b. False
A high-confidence rule can sometimes be misleading because confidence does not consider
support of the itemset in the rule consequent. Is This True ?
5.
a. Yes
b. No
recommend items based on similarity measures between users and/or
items.
1. Content Based Systems
6.
2. Hybrid System
3. Collaborative Filtering Systems
4. None of These
SUB : 410243 DA

There are major Classification of Collaborative Filtering Mechanisms

1. 1
7. 2. 2
3. 3
4. None of These
Movie Recommendation to peoples is an example of
1. User Based Recommendation
8. 2. Item Based Recommendation
3. Knowledge Based Recommendation
4. Content Based Recommendation
recommenders rely on an explicitly defined set of recommendation
rules.
1. Constraint Based
9.
2. Case Based
3. Content Based
4. User Based
Parallelized hybrid recommender systems operate dependently of one another and produce
separate recommendation lists.
10.
1. True
2. False
Association rules are sometimes referred to as
a. market basket analysis
11. b. Itemset Filtering
c. Frequent Itemset Analysis
d. None of these.
if 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8.
Similarly, if 60% of all transactions contain itemset {bread,butter}, then the support of
{bread,butter} is
12. a. 0.4
b. 0.5
c. 0.6
d. 0.7
Lift is defined as the measure of certainty or trustworthiness associated with each
discovered rule.
13.
a. TRUE
b. FALSE
is able to identify trustworthy rules, but it cannot tell whether a rule is
coincidental.
a. Lift
14.
b. Confidence
c. Support
d. Leverage
SUB : 410243 DA

recommend items based on similarity measures between users

and/or items. The items recommended to a user are those preferred by similar users.
a. Collaborative Filtering System
15.
b. Content Based Recommendation
c. Knowledge Based Recommendation
d. Hybrid Approaches
Pure collaborative approaches take a matrix of given user–item ratings as the only input
and typically produce output. Is it Pure Collaborative?
16.
a. Yes
b. No
With respect to the determination of the set of similar users, one common measure used in
17.
recommender systems is
a. Cosine Similarity Measure
b. Pearson’s correlation coefficient.
c. Mean Squared Error Method
d. None of these.
Large-scale e-commerce sites, often implement a different technique,
which is more apt for offline preprocessing and thus allows for
the computation of recommendations in real time even for a very large rating matrix.
18. a. Item-Based Recommendation
b. User-Based Recommendation
c. Content-Based Recommendation
d. None of these
Here are two very short texts to compare and find the cosine similarity measure?
I. Julie loves me more than Linda loves me
II. Jane likes me more than Julie loves me
19. a. 0.6
b. 0.7
c. 0.8
d. 0.9
is based on the availability of item descriptions and a profile that
assigns importance to these characteristics.
a. Item-Based Recommendation
20.
b. User-Based Recommendation
c. Content-Based Recommendation.
d. None of these
Consider the features of a movie which are not relevant to a recommendation system.
a. The set of actors of the movie.
21. b. The Director
c. The Year in which the movie was made
d. The Budget of the movie.
SUB : 410243 DA

A has been implemented, for similarity based retrieval under

nearest neighbors.
a. k-nearest-neighbor method (kNN)
22.
b. Conventional Neural Network (CNN)
c. Bayes Theorem
d. Naïve Bayes Classifier
Case-based recommenders focus on the retrieval of similar items on the basis of different
types of similarity measures
23.
a. TRUE
b. FALSE
In recommendation approaches, items are retrieved using similarity
measures that describe to which extent item properties match some given user’s
24. requirements.
a. Item-Based
b. Case-Based
c. Content-Based
d. User-Based
are based on a sequenced order of techniques, in which each succeeding
recommender only refines the recommendations of its predecessor.
a. Weighted Hybrids
25.
b. Mixed Hybrids
c. Cascade Hybrids
d. Switching Hybrids
require an oracle that decides which recommender should be
used in a specific situation, depending on the user profile and/or the quality of
recommendation
26. a. Weighted Hybrids
b. Mixed Hybrids
c. Cascade Hybrids
d. Switching Hybrids
SUB : 410243 DA

According to analysts, for what can traditional IT systems provide a

foundation when they’re integrated with big data technologies like
Hadoop?
(A) Big data management and data mining
(B) Data warehousing and business intelligence
(C) Management of Hadoop clusters
(D) Collecting and storing unstructured data

Answer
A

MCQ No - 2
What are the main components of Big Data?
(A) MapReduce
(B) HDFS
(C) YARN
(D) All of these

Answer
D

MCQ No - 3
What are the different features of Big Data Analytics?
(A) Open-Source
(B) Scalability
(C) Data Recovery
(D) All the above

Answer
D

MCQ No - 4
According to analysts, for what can traditional IT systems provide a
foundation when they’re integrated with big data technologies like
Hadoop?
(A) Big data management and data mining
(B) Data warehousing and business intelligence
(C) Management of Hadoop clusters
(D) Collecting and storing unstructured data

Answer
A

MCQ No - 5
What are the four V’s of Big Data?
(A) Volume
(B) Velocity
(C) Variety
(D) All the above
SUB : 410243 DA

Answer
D

All of the following accurately describe Hadoop, EXCEPT:

(A) Open-source

(B) Real-time

(D) Distributed computing approach

Answer
B

MCQ No - 7

___________ is general-purpose computing model and runtime system

for distributed data analytics.
(A) Mapreduce

(B) Drill

(D) None of the above

Answer
A

MCQ No - 8

The examination of large amounts of data to see what patterns or other

useful information can be found is known as
(A) Data examination

(B) Information analysis

(C) Big data analytics

(D) Data analysis

Answer
C

MCQ No - 9
SUB : 410243 DA

Big data analysis does the following except

(A) Collects data

(B) Spreads data

(C) Organizes data

(D) Analyzes data

Answer
B

MCQ No - 10

What makes Big Data analysis difficult to optimize?

(A) Big Data is not difficult to optimize

(B) Both data and cost effective ways to mine data to make business sense out of it

(C) The technology to mine data

(D) All of the above

Answer
B

The new source of big data that will trigger a Big Data revolution in the
years to come is
(A) Business transactions

(B) Social media

(C) Transactional data and sensor data

(D) RDBMS

Answer
C

MCQ No - 12

The unit of data that flows through a Flume agent is

(A) Log

(B) Row

Answer
C

MCQ No - 13

Listed below are the three steps that are followed to deploy a Big Data
Solution except
(A) Data Ingestion

(B) Data Processing

(C) Data dissemination

(D) Data Storage

Answer
C

MCQ No - 14

Check below the best answer to "which industries employ the use of so-
called "Big Data" in their day to day operations?
(A) Weather forecasting

(B) Marketing

(D) All of the above

Answer
D

MCQ No - 15

There are almost as many bits of information in the digital universe as

there are stars in the actual universe?
(A) True

(B) False

Answer
A
SUB : 410243 DA

MCQ No - 16

The word 'Big data' was coined by

(A) Roger Mougalas

(B) John Philips

(C) Simon Woods

(D) Martin Green

Answer
A

MCQ No - 17

The word 'Big Data' was coined in the year

(A) 2000

(B) 1970

(D) 2005

Answer
C

MCQ No - 18

Concerning the Forms of Big Data, which one of these is odd?

(A) Structured

(B) Unstructured

(D) Semi-Structured

Answer
C

MCQ No - 19

Big Data applications benefit the media and entertainment industry by

(A) Predicting what the audience wants

(B) Ad targeting
SUB : 410243 DA
(C) Scheduling optimization

(D) All of the above

Answer
D

MCQ No - 20

The feature of big data that refers to the quality of the stored data is
______
(A) Variety

(B) Volume

(D) Veracity

Answer
D
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

Name of the Teacher: Ms. P. S. Patil

Class: BE Subject: Data Analytics

AY: 2020-21 SEM: II

UNIT-1
1) What is Big Data?
a) Huge amount of data
b) Small amount of data
c) Huge File
d) Big Storage
Ans: a
Explanation: It is Huge amount of data
2) According to analysts, for what can traditional IT systems provide a
foundation when they’re integrated with big data technologies like Hadoop?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
Ans: a
Explanation: Big data management and data mining
3) What are the main components of Big Data?
a)MapReduce
b)HDFS
c)YARN
d)All of these
Ans: d
Explanation: All of these
4) The sources of Big Data are
a)Stock Exchange
b)Transport Data
c) Banking Data
d) All of the Above
Ans: d
Explanation:
5) Big Data Characteristics are:
a) Structured data
b) Semi-structured data
c) Quasi-structured data
d) All of the above
Ans: d
Explanation:
6) Bl tends to provide reports, dashboards, and queries on business
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

questions for the current period or in the past.

a) True
b) False
Ans: a
Explanation:
7) Big data can come in multiple forms, including structured and non-structured
data
a) True
b) False
Ans: a
Explanation:
8) BI problems tend to require highly structured data organized
a) Rows
b) Columns
c) Accurate Reporting
d) All of the Above
Ans: d
Explanation:
9) EDW achieves the objective of reporting and sometimes the creation of
dashboards, perform analysis on unstructured data
a) High-value data is hard to reach and leverage
b) Data moves in batches from EDW to local analytical tools
c) Data Science projects will remain isolated
d) All of the Above
Ans: d
Explanation:
10) Drivers of Big Data
a) Medical information
b) Photos and video footage uploaded to the World Wide Web
c) data extracts
d) Both a and b
Ans: d
Explanation:
11) According to analysts, for what can traditional IT systems provide a
foundation when they’re integrated with big data technologies like Hadoop?

a) Big data management and data mining

b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data

Ans: a
Explanation:
12) Select from option which is not the phase of data analytics
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

a) model planning
b) testing
c) discovery
d) operationalize

Ans: b

Explanation:
13) Which phase of data analytics require more time to complete

a) Data preparation
b) model building
c) communicate results
d) Discovery

Ans: a
Explanation:
14) What is analytic sandbox?

a) Tool
b) Separate repository
c) data cleaning
d) Data conditioning

Ans: b
Explanation:
15) The person which provides analytic techniques and modeling is called as.

a) Data Engineer
b) Data scientist
c) Business user
d) Project manager

Ans: b
Explanation:
16) What is task of Project manager?

a) analytic modelling
b) Provide requirement
c) ensure meeting objectives
d) creates DB environment

Ans: c
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

Explanation:
17) Identifying Key Stakeholders this task is performed in which phase?

a) Data preparation
b) model building
c) Discovery
d) communicate results

Ans: c
Explanation:
18) ETL process is performed in which phase

a) Discovery
b) communicate results
c) model planning
d) Data preparation

Ans: d
Explanation:
19) How much data Data science teams prefer for analysis?

a) too little
b) average
c) more
d) more than average

Ans: c
Explanation:
20) select from option tool which is not used in model planning phase

a) Data wrangler
b) R
c) SQL Analysis service
d) SAS/ACESS

Ans: c
Explanation:
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

21) if reports and dashboards will be impacted and need to change this task is
performed by.

a) Project sponsor
b) BI Analyst
c) Data Engineer
d) Project manager

Ans: b
Explanation:
22) What is need of data analytic lifecycle.

a) Data cleaning
b) To solve Big data problems
c) Data conditioning
d) Data Exploration

Ans: b
Explanation:
23) How many phases are there in data analytic lifecycle?

a) 4
b) 5
c) 6
d) 7
Ans: c
24) The person with technical skills is called as?

a) Business user
b) Data Engineer
c) Data scientist
d) Project sponsor
Ans: b
25) What is outcome of Model building phase?

a) Analytic results
b) Quality data
c) Data
d) Potential resources
Ans: a
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

Pravin S.Patil
Subject Teacher
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

Name of the Teacher: Ms. P. S. Patil

Class: BE Subject: Data Analytics

AY: 2020-21 SEM: II

UNIT-1I
1) 1. A statement made about a population for testing purpose is called?

a) Statistic
b) Hypothesis
c) Level of Significance
d) Test-Statistic
Ans: b
Explanation:
2) If the assumed hypothesis is tested for rejection considering it to be true is
called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: a
Explanation:
3) A statement whose validity is tested on the basis of a sample is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: b
Explanation:
4) A hypothesis which defines the population distribution is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: c
Explanation:
5) If the null hypothesis is false then which of the following is accepted?
a) Null Hypothesis
b) Positive Hypothesis
c) Negative Hypothesis
d) Alternative Hypothesis.
Ans: d
Explanation:
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

6) The rejection probability of Null Hypothesis when it is true is called as?

a) Level of Confidence
b) Level of Significance
c) Level of Margin
d) Level of Rejection
Ans: b
Explanation:
7) The point where the Null Hypothesis gets rejected is called as?
a) Significant Value
b) Rejection Value
c) Acceptance Value
d) Critical Value
Ans: d
Explanation:
8) If the Critical region is evenly distributed then the test is referred as?
a) Two tailed
b) One tailed
c) Three tailed
d) Zero tailed
Ans: a
Explanation:
9) The type of test is defined by which of the following?
a) Null Hypothesis
b) Simple Hypothesis
c) Alternative Hypothesis
d) Composite Hypothesis
Ans: c
Explanation:
10) Which of the following is defined as the rule or formula to test a Null
Hypothesis?
a) Test statistic
b) Population statistic
c) Variance statistic
d) Null statistic
Ans: a
Explanation:
11) Type 1 error occurs when?
a) We reject H0 if it is True
b) We reject H0 if it is False
c) We accept H0 if it is True
d) We accept H0 if it is False
Ans: a
Explanation:
12) The probability of Type 1 error is referred as?
a) 1-α
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

b) β
c) α
d) 1-β
Ans: c

Explanation:
13) Alternative Hypothesis is also called as?
a) Composite hypothesis
b) Research Hypothesis
c) Simple Hypothesis
d) Null Hypothesis
Ans: b
Explanation:
14) Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned
Ans: d
Explanation:
15) Point out the wrong statement.
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned
Ans: c
Explanation:
16) Hierarchical clustering should be primarily used for exploration.
a) True
b) False
Ans: a
Explanation:
17) Which of the following function is used for k-means clustering?
a) k-means
b) k-mean
c) heatmap
d) none of the mentioned
Ans: a
Explanation:
18) Which of the following clustering requires merging approach?
a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned
Ans: b
Explanation:
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

19) K-means is not deterministic and it also consists of number of iterations.

a) True
b) False
Ans: a
20) Depending on acceptance and rejection of null hypothesis there are 2 types of
error produced
a) Type 1
b) Type 2
c) None of these
d) All of these
Ans: d
21) The power of a test can be defined as a possibility of …

a) Rejecting null hypothesis

b) Accepting null hypothesis
c) Increasing null hypothesis
d) Decreasing null hypothesis
Ans: a
22) For a fixed significance level, a greater sample size is mandatory to discover
a
a) Minor difference in mean
b) Major difference in mean
c) Average difference in mean
d) None of the above
Ans: a
23) ANNOVA tests if any of the population means vary from other population
means
a) True
b) False
Ans: a
24) Clustering is defined as group of same kind of objects which are gathered by
use of
a) Unsupervised method
b) Supervised method
c) Semi supervised method
d) None of these
Ans: a
25) Following are the applications of Kmeans

a) Image Processing
b) Medical
c) Customer Segmentation
d) All of the above
ZEAL EDUCATION SOCIETY’S
ZEAL COLLEGE OF ENGINEERING AND RESEARCH
NARHE │PUNE -41 │ INDIA
DEPARTMENT OF COMPUTER ENGINEERING

Ans: d

Pravin S.Patil
Subject Teacher
Unit-I

1. Data in ___________ bytes size is called Big Data.

A. Tera
B. Giga
C. Peta
D. Meta

View Answer

Ans : C
Explanation: data in Peta bytes i.e. 10^15 byte size is called Big Data.

2. How many V's of Big Data

A. 2
B. 3
C. 4
D. 5

View Answer

Ans : D
Explanation: Big Data was defined by the “3Vs” but now there are “5Vs” of Big Data which are
Volume, Velocity, Variety, Veracity, Value

3. Transaction data of the bank is?

A. structured data
B. unstructured datat
C. Both A and B
D. None of the above

View Answer

Ans : A
Explanation: Data which can be saved in tables are structured data like the transaction data of the
bank.

4. In how many forms BigData could be found?

A. 2
B. 3
C. 4
D. 5
View Answer

Ans : B
Explanation: BigData could be found in three forms: Structured, Unstructured and Semi-structured.

5. Which of the following are Benefits of Big Data Processing?

A. Businesses can utilize outside intelligence while taking decisions
B. Improved customer service
C. Better operational efficiency
D. All of the above

View Answer

Ans : D
Explanation: All of the above are Benefits of Big Data Processing.

6. Which of the following are incorrect Big Data Technologies?

A. Apache Hadoop
B. Apache Spark
C. Apache Kafka
D. Apache Pytarch

View Answer

Ans : D
Explanation: Apache Pytarch is incorrect Big Data Technologies.

7. The overall percentage of the world’s total data has been created just within the past two years is ?
A. 80%
B. 85%
C. 90%
D. 95%

View Answer

Ans : C
Explanation: The overall percentage of the world’s total data has been created just within the past
two years is 90%.

8) Which of the following step is performed by data scientist after acquiring the data?
a) Data Cleansing
b) Data Integration
c) Data Replication
d) All of the mentioned
Ans: Data Cleansing

9)3V’s are not sufficient to describe big data.

a) True
b) False
Ans: True

10. Communicative and collaborative is one among the key skill sets and behavioral characteristics of a
data scientist [True / False]?

a. True

b. False

Answer : a

11. ---------- are the sources of Bigdata [select all that apply]

I. Book
II. Facebook
III. Genome sequence
IV. Video Surveillance
Ans:

12. BI analyses the past data and make future predictions True/False ?
a. True

b. False

Answer : b

12. In which phase of data analytics ETLT is performed?

Ans: Phase 2 Data preparation is done in this phase. An analytical sandbox is used in this to perform
analytics for the entire duration of the project. While you explore, preprocess and condition data,
modeling follows suit. To get the data into the sandbox, you will perform ETLT (extract, transform, load
and transform).

A. Discovery
B. Model Planning

C. Model Building

D. Data Preparation

13. In which data analytics lifecycle phase is an analytic sandbox prepared?

Phase 2 — Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team
can work with data and perform analytics for the duration of the project. The team needs to execute
extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the sandbox.

A. Data Preparation

B. Model Planning

C. Model Building

D. Discovery

14. In which phase would the team expect to invest most of the project time?

A. Data Preparation

B. Model Planning

C. Model Building

D. Discovery

15. In which phase would the team expect to invest least time of the project time?

A. Data Preparation

B. Model Planning

C. Model Building

D. Discovery

16. from following tools which tool is used for Model building?

a. Hadoop b. Octave c. OpenRefine d. All of Above

Ans B

17. from following tools which tool is used for Data preparation
a. Alpine Miner b. Excel c. Matlab d.Weka

Ans . A

18. To determine if the project was completed on time and within budget, is the key role of _____

A. Project Sponsor

B. Project Manager

C. Data Engineer

D. Data Scientist

19. How many Phases are there in Data Analytics Lifecycle?

A. 3

B. 6

C. 7

D. Any

20. In data Analytics life cycle we can move back and refine the work done. True or False

A. True

B. False

21. What are the key outputs from Analytics Projects?

A. PPT

B.report

C. code

D. All of above

22. ________ provides subject matter expertise for analytical techniques, data modeling and applying
valid analytical techniques to give business problems.

A. Project Sponsor

B. Project Manager

C. Data Engineer

D. Data Scientist
Unit-II

1. A statement about a population developed for the purpose of testing is called:

(a) Hypothesis

(b) Hypothesis testing

(c) Level of significance

(d) Test-statistic

Answer : a

2. Any hypothesis which is tested for the purpose of rejection under the assumption that it is true is
called:

(a) Null hypothesis

(b) Alternative hypothesis

(c) Statistical hypothesis

(d) Composite hypothesis

Answer : a

3. A statement that is accepted if the sample data provide sufficient evidence that the null hypothesis is
false is called:

(a) Simple hypothesis

(b) Composite hypothesis

(c) Statistical hypothesis

(d) Alternative hypothesis

Answer : d

4. The alternative hypothesis is also called:

(a) Null hypothesis

(b) Statistical hypothesis

(c) Research hypothesis

(d) Simple hypothesis

Answer : c

5. The probability of rejecting the null hypothesis when it is true is called:

(a) Level of confidence

(b) Level of significance

(c) Power of the test

(d) Difficult to tell

Answer : b

6. If the critical region is located equally in both sides of the sampling distribution of test-statistic, the
test is called:

(a) One tailed

(b) Two tailed

(c) Right tailed

(d) Left tailed

Answer : b

7. The choice of one-tailed test and two-tailed test depends upon:

(a) Null hypothesis

(b) Alternative hypothesis

(c) None of these

(d) Composite hypotheses

Answer : b

8. Test of hypothesis Ho: µ = 50 against H1: µ > 50 leads to:

(a) Left-tailed test

(b) Right-tailed test

(c) Two-tailed test

(d) Difficult to tell

Answer : b

9. Testing Ho: µ = 25 against H1: µ ≠ 25 leads to:

(a) Two-tailed test

(b) Left-tailed test

(c) Right-tailed test

(d) Neither (a), (b) and (c)

Answer : a

10. A formula that provides a basis for testing a null hypothesis is called:

(a) Test-statistic

(b) Population statistic

(c) Both of these

(d) None of the above

Answer : a

11. 1 – α is also called:

(a) Confidence coefficient

(b) Power of the test

(c) Size of the test

(d) Level of significance

Answer : a

12. Area of the rejection region depends on:

(a) Size of α

(b) Size of β

(d) Number of values

Answer : a
13. Student’s t-test is applicable only when:

(a) n≤30 and σ is known

(b) n>30 and σ is unknown

(c) n=30 and σ is known

(d) All of the above

Answer : a

14. In an unpaired samples t-test with sample sizes n1= 11 and n2= 11, the value of tabulated t should be
obtained for:

(a) 10 degrees of freedom

(b) 21 degrees of freedom

(c) 22 degrees of freedom

(d) 20 degrees of freedom

Answer : d

15. The purpose of statistical inference is:

(a) To collect sample data and use them to formulate hypotheses about a population

(b) To draw conclusion about populations and then collect sample data to support the conclusions (c) To
draw conclusions about populations from sample data

(d) To draw conclusions about the known value of population parameter

Answer : c

16. The histogram to the right represents the hospital length of stay (in days) for patients at a nearby
medical facility. How many patients are included in the histogram?

a. 5

b. 21

c. 17

d. 9
Answer : b

17. Using the histogram to the right that represents the hospital lengths of stay (in days) for patients at a
nearby medical facility, determine the relationship between the mean and the median.

a. Mean = Median

b. Mean ≈ Median

c. Mean < Median

d. Mean > Median

Answer : d

18. The statement “If there is sufficient evidence to reject a null hypothesis at the 10%

significance level, then there is sufficient evidence to reject it at the 5% significance level” :

Please select the best answer of those provided below.

a. Always True

b. Never True

c. Sometimes True; the p-value for the statistical test needs to be provided for a conclusion

d. Not Enough Information; this would depend on the type of statistical test used

Answer : c

19.Analysis of variance in short form is?

a) ANOV

b) AVA

c) ANOVA

d) ANVA

Ans:c
20) Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned
Ans: defined distance metric, number of clusters, initial guess as to cluster centroids

21) Hierarchical clustering should be primarily used for exploration.

a) True
b) False
Ans: True

22) Which of the following function is used for k-means clustering?

a) k-means
b) k-mean
c) heatmap
d) none of the mentioned
Ans: k-means

23) The goal of clustering a set of data is to

a)divide them into groups of data that are near each other
b)choose the best data from the set
c)determine the nearest neighbors of each of the data
d)predict the class of data
Ans: divide them into groups of data that are near each other

24) The k-means algorithm...

a)always converges to a clustering that minimizes the mean-square
vector-representative distance
b)can converge to different final clustering, depending on initial choice of
representatives
c)is widely used in practice
d)is typically done by hand, using paper and pencil
e)should only be attempted by trained professionals
Ans: an converge to different final clustering, depending on initial choice of representatives, is
widely used in practice

25) Considering the K-means algorithm, after current iteration, we have 3 centroids (0, 1) (2, 1),
(-1, 2). Will points (2, 3) and (2, 0.5) be assigned to the same cluster in the next iteration?
a) Yes
b) No
Ans: Yes

26) What are the two types of Hierarchical Clustering?

a)Top-Down Clustering (Divisive)
b)Bottom-Top Clustering (Agglomerative)
c)Dendrogram
d)K-means
Ans: Top-Down Clustering (Divisive), Bottom-Top Clustering (Agglomerative)

27) The most commonly used measure of similarity is the _____ or its square.
a)euclidean distance
b)city-block distance
c)Chebychev’s distance
d)Manhattan distance
Ans: euclidean distance

29) Which of the following is required by K-means clustering?

a)defined distance metric
b)number of clusters
c)initial guess as to cluster centroids
Ans: defined distance metric, number of clusters, initial guess as to cluster centroids

30) Clustering is a-
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Ans: Unsupervised learning

31) Which of the following clustering algorithms suffers from the problem of convergence at local
optima?
A. K- Means clustering
B. Hierarchical clustering
C. Diverse clustering
D. All of the above
Ans: K- Means clustering, Hierarchical clustering, Diverse clustering

32) Which version of the clustering algorithm is most sensitive to outliers?

A. K-means clustering algorithm
B. K-modes clustering algorithm
C. K-medians clustering algorithm
D. None
Ans: K-means clustering algorithm

33) Which of the following is a bad characteristic of a dataset for clustering analysis-
A. Data points with outliers
B. Data points with different densities
C. Data points with non-convex shapes
D. All of the above
Ans: Data points with outliers, Data points with different densities, Data points with non-convex
shapes
34) For clustering, we do not require-
A. Labeled data
B. Unlabeled data
C. Numerical data
D. Categorical data
Ans: Labeled Data

35) Which of the following is an application of clustering?

A. Biological network analysis
B. Market trend prediction
C. Topic modeling
D. All of the above
Ans: Biological network analysis, Market trend prediction, Topic modeling

36) The final output of Hierarchical clustering is-

A. The number of cluster centroids
B. The tree representing how close the data points are to each other
C. A map defining the similar data points into individual groups
D. All of the above
Ans: The tree representing how close the data points are to each other

37. Which type of test is the Wilcoxon rank sum test?

a. Parametric

b. non parametric

c. Distributed

d. Normal

38. Input data for Wilcoxon test is normally distributed, True or False?

39. What is the null hypothesis for a Wilcoxon test?

a.Two group means are equal.

b.Two or more group means are equal.

c.Two mean groups are not equal.

d. None of these
40 Which of following test statics is used in Wilcoxon Rank Sum Test?

a. test statistics <= critical value, Ho will be Rejected

b. if test statistics > critical value, Ho will be Rejected

c. if test statistics >critical value, Ho will be accepted

d. none of these.

40. What must you include when applying Wilcoxon Rank sum test?

a. variance

b. Critical Value

c. Rank sum

e. standard deviation

Type1 and type 2 error

41. Type 1 is also called as

a. False Positive

b. false negative

c. True Positive

d. True negative

42. Type 2 is also called as

a. False Positive

b. False negative

c. True Positive

d. True negative

43. Type 1 error occurs when_____

a. Null hypothesis rejected when it is true.

b. Null hypothesis is accepted when it is false
c. Null hypothesis rejected when it is false
d. None of Above

44. Type 2 error occurs when_____

a. Null hypothesis rejected when it is true.

b. Null hypothesis is accepted when it is false
c. Null hypothesis rejected when it is false
d. All of above

ANOVA

44. Analysis of Variance is statistical method of comparing____of several populations.

a. Means

b. variance

c. standard Deviation

d. None of above.

45. ANOVA is used when____

a. If more than two population

b for two population

c. for Three population

d. for any populations

46. What is Null Hypothesis in ANOVA?

a. all group means are equal

b. Three group means are equal

c. atleast one pair of group means unequal.

d. all group means are unequal.

47. What do ANOVA calculate?

a. Z-score

b. F ratio

c. T-score

d. Chi Square

Q.25 What are the two types of variance which can occur in your data?

a. Independent and Dependent

b. Between and within groups

c. Personal and interpersonal

d. Anova and Anoca

Q.26 If between group mean sum of square variability increases value of F statistics_____

a. Increases
b. Decreases
c. Neutral
d. None of these
Q.27 What must you include when applying ANOVA test?

a. Means

b. Critical Value

c. degree of freedom

d. F statistics

e. All of above

Q.28 How many dependent variables are there in a two-way ANOVA?

a.1

b.3

c.2
d.any

Q.29 Which of following test statics is used in ANOVA?

a.if critical value > F ratio, Ho will be Rejected

b.if critical value < F ratio, Ho will be Rejected

c.if critical value > F ratio, Ho will be accepted

d.None of these

Q.30 Various types of ANOVA are___.

a.Two way ANOVA

b.ANCOVA

c.MANOVA

d.ZANOVA
Unit-III

1.A collection of one or more items is called as _____

(A)Itemset

(B)Support

(C)Confidence

(D)Support Count

Ans:A

2.Frequency of occurrence of an itemset is called as _____

(A)Support

(B)Confidence

(C)Support Count

(D)Rules

Ans:C

3.An itemset whose support is greater than or equal to a minimum support threshold is ______

(A)Itemset

(B)Frequent Itemset

(C)Infrequent items

(D)Threshold values

Ans:B

4.What does FP growth algorithm do?

(A)It mines all frequent patterns through pruning rules with lesser support

(B)It mines all frequent patterns through pruning rules with higher support

(C)It mines all frequent patterns by constructing a FP tree

(D)It mines all frequent patterns by constructing an itemsets

Ans:C

5.What techniques can be used to improve the efficiency of apriori algorithm?

(A)Hash-based techniques

(B)Transaction Increases

(C)Sampling

(D)Cleaning

Ans:A

6. Linear Regression is a supervised machine learning algorithm.

A) TRUE

B) FALSE

Ans:A

7.It is possible to design a Linear regression algorithm using a neural network?

A) TRUE

B) FALSE

Ans:A

8.Which of the following methods do we use to find the best fit line for data in Linear
Regression?

A) Least Square Error

B) Maximum Likelihood

C) Logarithmic Loss

D) Both A and B

Ans:A

9. A local retailer has a database that stores 10,000 transactions of lastsummer. After
analyzing the data,a data science team has identified thefollowing statistics:• {battery}
appears in 6,000 transactions.• {sunscreen}appears in 5,000 transactions.• {sandals}
appears in 4,000 transactions.•{bowls} appears in 2,000 transactions.• {battery, sunscreen}
appears in1,500 transactions.• {battery, sandals} appears in 1,000 transactions.•{battery,
bowls} appears in 250 transactions.• {battery, sunscreen, sandals}appears in 600
transactions. Q) What are the confidence values of{battery}->{ sunscreen} and {battery,
sunscreen}->{ sandals} ?
a) 0.3 and 0.4
b) 0.25 and 0.4
c) 0.25 and 0.15
d) 0.6 and 0.4
Ans: b

10.Which of the following implies no relationship with respect to correlation?

a) Cor(X, Y) = 1

b) Cor(X, Y) = 0

c) Cor(X, Y) = 2

d) All of the mentioned

Ans:b

11. If Linear regression model perfectly first i.e., train error is zero, then
_____________________

a) Test error is also always zero

b) Test error is non zero

c) Couldn’t comment on Test error

d) Test error is equal to Train error

Ans:C

12.Which of the following metrics can be used for evaluating regression models?

i) R Squared

ii) Adjusted R Squared

iii) F Statistics

iv) RMSE / MSE / MAE

a) ii and iv

b) i and ii

c) ii, iii and iv

d) i, ii, iii and iv

Ans:d

13.How many coefficients do you need to estimate in a simple linear regression model (One
independent variable)?

a) 1

b) 2

c) 3

d) 4

Ans:b

14.In a simple linear regression model (One independent variable), If we change the input
variable by 1 unit. How much output variable will change?

a) by 1

b) no change

c) by intercept

d) by its slope

Ans:d

15.Function used for linear regression in R is __________

a) lm(formula, data)

b) lr(formula, data)

c) lrm(formula, data)

d) regression.linear(formula, data)

Ans:a
16.In syntax of linear model lm(formula,data,..), data refers to ______

a) Matrix

b) Vector

c) Array

d) List

Ans:b

17.In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers to
__________

a) (X-intercept, Slope)

b) (Slope, X-Intercept)

c) (Y-Intercept, Slope)

d) (slope, Y-Intercept)

Ans:c

18. ________ is an incredibly powerful tool for analyzing data.

a) Linear regression

b) Logistic regression

c) Gradient Descent

d) Greedy algorithms

Ans:a

19.The square of the correlation coefficient r 2 will always be positive and is called the
________

a) Regression

b) Coefficient of determination
c) KNN

d) Algorithm

Ans:b

20.Predicting y for a value of x that’s outside the range of values we actually saw for x in the
original data is called ___________

a) Regression

b) Extrapolation

c) Intrapolation

d) Polation

Ans:b

21.What is predicting y for a value of x that is within the interval of points that we saw in the
original data called?

a) Regression

b) Extrapolation

c) Intrapolation

d) Polation

Ans:c

22. ________ is a simple approach to supervised learning. It assumes that the dependence of Y
on X1, X2, . . . Xp is linear.

a) Linear regression

b) Logistic regression

c) Gradient Descent

d) Greedy algorithms
Ans:a

23.Although it may seem overly simplistic, _______ is extremely useful both conceptually and
practically.

a) Linear regression

b) Logistic regression

c) Gradient Descent

d) Greedy algorithms

Ans:a

24. __________ refers to a group of techniques for fitting and studying the straight-line
relationship between two variables.

a) Linear regression

b) Logistic regression

c) Gradient Descent

d) Greedy algorithms

Ans:a

25. What do you mean by support(A)?

a. Total number of transactions containing A
b. Total Number of transactions not containing A
c. Number of transactions containing A / Total number of transactions
d. Number of transactions not containing A / Total number of transactions

Ans: c
Data Processing and Analysis
Unit 4

Multiple Choice Questions with Answer Key

1. What is a hypothesis?

a. A statement that the researcher wants to test through the data

collected in a study.
b. A research question the results will answer.
c. A theory that underpins the study.
d. A statistical method for calculating the extent to which the results
could have happened by chance.

Answer: a

2. Qualitative data analysis is still a relatively new and rapidly

developing branch of research methodology.

a. True
b. False

Answer: a

3.. The process of marking segments of data with symbols,

descriptive words, or category names is known as _______.

a. Concurring
b. Coding
c. Colouring
d. Segmenting

Answer: b
4. What is the cyclical process of collecting and analysing data
during a single research study called?

a. Interim analysis
b. Inter analysis
c. Inter-item analysis
d. Constant analysis

Answer: a

5. The process of quantifying data is referred to as _________.

a. Typology
b. Diagramming
c. Enumeration
d. Coding

Answer: c

6. An advantage of using computer programs for qualitative data is

that they _______.

a. Can reduce time required to analyse data (i.e., after the data are
transcribed)
b. Help in storing and organising data
c. Make many procedures available that are rarely done by hand
due to time constraints
d. All of the above

Answer: d
7. Boolean operators are words that are used to create logical
combinations.

a. True
b. False

Answer: a

8. __________ are the basic building blocks of qualitative data.

a. Categories
b. Units
c. Individuals
d. None of the above

Answer: a

9. This is the process of transforming qualitative research data from

written interviews or field notes into typed text.

a. Segmenting
b. Coding
c. Transcription
d. Mnemoning

Answer: c

10. A challenge of qualitative data analysis is that it often includes

data that are unwieldy and complex; it is a major challenge to make
sense of the large pool of data.
a. True
b. False

Answer: a
11. Hypothesis testing and estimation are both types of descriptive
statistics.

a. True
b. False

Answer: b

12. A set of data organised in a participants(rows)-by-

variables(columns) format is known as a “data set.”

a. True
b. False

Answer: a

13. A graph that uses vertical bars to represent data is called a ___

a. Line graph

b. Bar graph
c. Scatterplot
d. Vertical graph

Answer: b

14. ___________ are used when you want to visually examine the
relationship between two quantitative variables.

a. Bar graphs
b. Pie graphs
c. Line graphs
d. Scatterplots

Answer: d
15. The denominator (bottom) of the z-score formula is

a. The standard deviation

b. The difference between a score and the mean
c. The range
d. The mean

Answer: a

16. Which of these distributions is used for a testing hypothesis?

a. Normal Distribution
b. Chi-Squared Distribution
c. Gamma Distribution
d. Poisson Distribution

Answer b

17. A statement made about a population for testing purpose is

called?

a. Statistic
b. Hypothesis
c. Level of Significance
d. Test-Statistic

Answer: b
18. If the assumed hypothesis is tested for rejection considering it to
be true is called?

a. Null Hypothesis
b. Statistical Hypothesis
c. Simple Hypothesis
d. Composite Hypothesis

Answer: a

19. If the null hypothesis is false then which of the following is

accepted?

a. Null Hypothesis
b. Positive Hypothesis
c. Negative Hypothesis
d. Alternative Hypothesis.

Answer: d

20. Alternative Hypothesis is also called as?

a. Composite hypothesis
b. Research Hypothesis
c. Simple Hypothesis
d. Null Hypothesis

Answer: b
marks question A B C D ans
A group of 4 bits is also
0 1 Nibble Byte Kb None 4 bits make one nibble.
called?
There are how many types of
1 1 3 2 1 None Big Data is of 3 types.
Big Data:
Which of the following are the
2 1 All Volume Variety Velocity. This is an explaination.
V's of Big Data:
Which of these is not a
3 1 Storage Volume Variety Velocity. This is an explaination.
characterstic of Big data?
Which of the following is a Big Data requires high cost to
4 2 Cost Significant Process Fraud Detection
drawback of Big Data: maintain huge amount of data
GINA stands for Global
Global Innovation Network and Global Invention in Globally Investment in
5 2 Fullform of GINA is: None Innovations Networks and
Analysis. Networks and Analytics Neurons and Analytics
Analysis.
Which is the phase 3 in Data Model Planning is the 3rd phase
6 2 Model Planning Model Building Data Preparation Operationalize
Analytics Life cycle. in life cycle.
GINA team thought to GINA targeted to achieve three
7 2 3 2 1 5
accomplish mainly____ goals: goals for the project.
The Data Preparation stage
8 2 Analyzation Collection Cleansing Processing. This is an explaination.
doesn’t involve:
Unstructured Data is further Unstructured data is divided into
9 2 2 3 4 5
divided into how many types? 2 types.
The GINA team mainly used
The team used Tableau to
10 2 which software tool to analyze Tableau Hadoop HIVE SQL
visualize the Data.
the Data
Which of the follwing is the first
11 2 step of Data Analytics Life Discovery Data Preparation. Model Planning Data Aware This is an explaination.
Cycle:
There are how many phases in there are 6 stages in data
12 2 6 5 4 7
data analytics life cycle: analytics life cycle.
SEMMA Methodology has SEMMA methodology has five
13 2 5 4 6 7
how many stages: stages.
Which phase of Life Cycle
Phase 5 involves collaboration
14 2 requires collaboration with Phase 5 Phase 6 Phase 4 Phase 3
with stakeholders.
stakeholders?
In Building a Model, how many
15 2 2 3 4 5 This is an Explaination.
phases are required:
How much Data in the whole Only 20% of world's total data is
16 2 0.2 0.4 0.6 0.5
world is structured: structured.
10^7 bytes of memory is equal
17 2 1ZB 1TB 1YB 1XB 10^7 B is equal to 1 ZB.
to:
Data Scientists in the GINA
NLP technique was used on the
team used which technique on Natural Language
18 2 Hadoop HIVE SQL description of Innovation
the textual Description of the Processing(NLP)
Roadmap Idea.
Innovation Roadmap Idea.
How many types of data Two types of data anlytical
19 2 analytics methodologies are 2 4 3 6 methodologies are there. EDA
there? and CDA
Bell Curve is also known as
20 3 Other name for Bell Curve is: Normal Distribution. Poisson Distribution Bionomial Distribution Bernoulli Distribution.
normal distribution.
One of the most important tasks
One of the most important
21 3 Statical Modeling Testing of Data Visualization Operationalize in big data analytics is statistical
tasks in big data analytics is:
modeling
Some of the approaches
considered for building the data
22 3 All CRISP-DM SEMMA MAD Skills This is an explaination.
analytics lifecycle framework
best practices are:
In Phase 4, the team develops
23 3 All Testing of Data Training of Data Production purposes This is an explaination.
datasets for:
Cross International Company's Initial CRISP-DM stands for Cross
Fullform of CRISP-DM Cross Industry Standard Process Common Industry Standard
24 3 Standard Process for Standards Progress for Industry Standard Process for
Methodology is: for Data Mining Program for Data Mining
Data Modeling Data Methods Data Mining.
SEMMA Methodology
25 3 doesn’t include which of the Evaluate Sample Explore Asses This is an Explaination.
following stages:
In Which stage, the data is In last phase i.e. Opeartionalize
monitored and analyzed to see Data is monitored and analyzed
26 3 Operationalize Collection Plan Model Data Aware
if the generated model is to see if the generated model is
creating the expected results. creating the expected results.
Data is captured in how many
27 3 3 4 5 6 Data is captured in 3 main ways.
ways:
marks question A B C D ans
In phase 2 of the Data
The team performs ETL and
Anlaytics Life Cycle, the team
28 3 3 2 4 6 ELT and ETLT in 2nd phase of
performs how many analytics
the cycle.
to get the data in the sandbox.
The total area under the bell Area under the bell curve is 1
29 3 1 2 3 4
curve is____unit. unit.
Wilcoxon rank-sum test is also Wilcoxon rank-sum test is also
30 1 Mann-Whiteney U test Mean Difference Alternative Hypothesis Null Hypothesis
known as? called Mann- Whiteney U Test.
Which test is also known as T-
31 1 Hypothesis Test Mean Difference K-means test None This is an explaination.
test?
This eqn is of Mean difference
32 1 This equation is of which test? Mean Difference K-Means Null Hypothesis Alternative Hypothesis
test.
A test of a statistical A test of a statistical hypothesis,
hypothesis, where the region of where the region of rejection is
33 1 rejection is on a side of the One tailed test Two-tailed test Tailed test Null test on only one side of the sampling
sampling distribution, is distribution, is called a one-tailed
called___________. test
How many types of Statical There are two types of Statical
34 1 2 3 4 6
Hypothesis is there? Hypothesis.
Analysis of Variance is also ANOVA stands for Analysis of
35 1 ANOVA Mean Difference Alternative Hypothesis Null Hypothesis
refered as? Variance.
How many steps are involved There are 4 steps in Hypothesis
36 1 4 2 3 5
in a Hypothesis Testing? testing.
The strength of evidence in The strength of evidence in
37 2 support of a null hypothesis is P-value K-value H-value Null-value support of a null hypothesis is
measured by? measured by the P-value.
Difference in means is also Difference in means is also
38 2 Two sample t-test T- test M-test Two sample test
called? known as two sample t test.
The k-medoids is also The k-medoids is also called
Partitioning Around Medoids
39 2 called_______________ Lloyd's Algorithm Poisson's Algorithm Regression partitioning around medoids
(PAM)
algorithm. (PAM) algorithm .
Clustering is an example of Clustering is an example of
40 2 Unsupervised Learning Supervised Learning Classification Regression
____? unsupervised learning.
Which of the following is not an
41 2 advantage of K means Requires a Priori Fast Robust easy to evaluate. This is an explaination.
Clustering?
The probability of committing a The probability of committing a
42 2 Beta Alpha Delta Theta
Type 2 error is called Type II error is called Beta
The______ variation we have
The less variation we have within
within clusters, the more
clusters, the more homogeneous
43 2 homogeneous (similar) the data Less More Variable Fixed
(similar) the data points are
points are within the same
within the same cluster.
cluster.
Which hypothesis is usually the Null Hypothesis is usually the
hypothesis in which sample hypothesis that sample
44 2 Null-Hypothesis Mean Difference K-means test Alternative Hypothesis
observations result is purely observations result purely from
from chance? chance.
Classical" ANOVA for
Classical" ANOVA for balanced
45 2 balanced data does how many 3 2 1 4
data does three things at once.
things at once?
K-mean clustering is used to NP hard problems are solved
46 2 NP-hard problems NP Problems Hypothesis Problems P problems
solve which problems? using K means clustering.
The probability of committing a The probability of committing a
47 2 Alpha Beta Gama Delta
Type I error is called? Type I error is called alpha
K means Clustering is also K means clustering is also called
48 2 Lloyd's Algorithm Gaussian Algorithm Poisson's Algorithm None
known as? Lloyds algo.
Which algorithm requires the k-means clustering requires the
49 3 user to specify the number of K-means clustering Gaussian Algorithm Alternative Hypothesis Null Hypothesis user to specify the number of
clusters k to be generated. clusters k to be generated.
K means clsutering uses which expectation-maximization
50 3 approach to solve the Expectation-maximization Greedy Approach Divide and Conquer None technique is used by k means
problems? clustering.
How many factors affect the The power of a hypothesis test is
51 3 3 2 1 4
power of a hypothesis test? affected by three factors.
Law of variance is also called
52 3 Law of Variance is called? Eve's Law Laplace Law Poisson's Algorithm Regression
Eve's law.
K-Medoids use which K Medoids use greddy
53 3 Greedy Approach Divide and Conquer Recursive None
approach to solve problems? approach to solve problems
The time complexity of k Time complexity is O(n^2) of k
54 3 O(n^2) O(nlogn) O(n) O(1)
means clustering is? means clustering.
the number (k ) of clusters
The number k of clusters
55 3 assumed in k-medoids is Priori Null Hypothesis ANNOVA Effect size
assumed known as priori.
known as?
marks question A B C D ans
The effect size is the difference
What is the difference between
between the true value and the
56 3 the true value and the value Effect -size Null Hypothesis Alternative Hypothesis ANOVA
value specified in the null
specified in the null hypothesis.
hypothesis.
Time complexity of k medoids
57 3 O(n^2) O(nlogn) O(n) O(n^3) This is an explaination.
is?
Which algorithm aims at K means algorithm aims at
58 3 minimizing an objective function K-means Mean Difference Alternative Hypothesis ANOVA minimizing an objective function
know as squared error function know as squared error function
Which algorithm was the
Apriori Algorithm was earliest in
59 1 earliest of the association rule Apriori Algorithm Gaussian Algorithm K means clustering Bernoulli Distribution.
the association of algorithms.
algorithms?\n
The Apriori algorithm takes The Apriori algorithm takes a
a______ iterative approach to bottom-up iterative approach to
60 1 uncovering the frequent Bottom-Up Top-Down Recursive None uncovering the frequent itemsets
itemsets by first determining all by first determining all the
the possible items possible items
Apriori uses breadth-first search
Apriori uses which structure to
and a Hash tree structure to
61 1 count candidate item sets BFS DFS Queue Stack
count candidate item sets
efficiently?
efficiently
"y=a+b*x^2". This equation
62 1 Polynomial Regression Logistic Regreasion Linear Regression Lasso Regression This is an explaination.
shows which regression?
__________ is defined as the Confidence is defined as the
measure of certainty or measure of certainty or
63 2 Confidence Recursion Item-set None
trustworthiness associated with trustworthiness associated
each discovered rule. with\neach discovered rule.
In which Regression, we In Logistic Regression, we
64 2 Logistic Regression Linear Regression Both None
predict the value by 1 or 0? predict the value by 1 or 0.
The formula for linear The formula for linear regression
65 2 Y’ = bX+A Y’ = bX - A. Y’ = bX /A. Y’ = bX * A.
regression is: is: Y’ = bX + A.
Which regression is useful PLS regression is also useful
Partial Least Squares(PLS)
66 2 when there are a large number Cox Regression Lasso Regression Logistic Regression when there are a large number of
Regression
of independent variables. independent variables.
Which regression is an Simple linear regression is an
67 2 approach for predicting a Linear-Regression Logistic Regreasion Elasticnet Regression None approach for predicting a
response using a single feature. response using a single feature.
Association rule mining consists Association rule mining consists
68 2 2 3 4 5
of _______ steps. of 2 steps
Which type of regression is Ordinal regression is suitable
69 2 suitable when dependent Ordinal Regression Linear Regression Cox Regession Logistic Regression when dependent variable is
variable is ordinal in nature? ordinal in nature
Which regression is used for ElasticNet regression is used for
70 2 ElasticNet Regression Linear Regression Logistic Regression None
support vector machines support vector machines,
Which regression can solve Support-Vector Regession can
71 2 both linear and non-linear Support Vector Regression Linear Regression Logistic Regression ElasticNet Regression solve both linear and non linear
models? models.
Which is the most common Least Square Method is the most
72 2 method used for fitting a Least Square Method Mean Difference Null Hypothesis Classification common method used for fitting
regression line a regression line
_______problems are when A regression problem is when
73 2 the output variable is a real or Regression Classification Recursive Hypothesis the output variable is a real or
continuous value. continuous value.
Linear Regression is a machine
Linear Regression is a machine
learning algorithm based on
74 2 Supervised Learning Unsupervised Learning Recursive Learning All learning algorithm based on
______ learning regression
supervised regression algorithm.
model.
When dependent variable's
When dependent variable's
variability is not equal across
variability is not equal across
75 2 Heteroscedasticity Homooscedasticity Multicolinearity Outliers. values of an independent
values of an independent
variable, it is called
variable, it is called
heteroscedasticity
_________requires large Logistic Regression requires
sample sizes because maximum large sample sizes because
76 2 likelihood estimates are less Logistic Regression Linear Regression Lasso Regression ElasticNet Regression maximum likelihood estimates
powerful at low sample sizes are less powerful at low sample
than ordinary least square sizes than ordinary least square
PCR Regression is divided into PCR regression is divided into 2
77 2 2 3 4 5
how many steps? steps
78 3 L2 regularization is also called? Tikhonov Regularization Norm Regularization Poisson's Regularization None This is an explaination.
When the variance of count When the variance of count data
79 3 data is greater than the mean Overdispersion Underdispersion Dispersion High dispersion is greater than the mean count, it
count, it is a case of? is a case of overdispersion
marks question A B C D ans
Which regression assumes the Linear regression assumes the
80 3 normal distribution of the Linear-Regression Logistic Regreasion Elasticnet Regression None normal or gaussian distribution of
dependent variable? the dependent variable.
Nature of predicted data in Nature of predicted data in
81 3 Ordered Unordered Both None
regression is? regression is ordered.
Which regression uses a binary Logistic regression uses a binary
82 3 dependent variable but ignores Logistic Regression Linear Regression Cox Regession Lasso Regression dependent variable but ignores
the timing of events. the timing of events.
The Ridge Regression is also The ridge regression is also
83 3 Shrinkage Regression Percentile Regression Elasticnet Regression Lasso Regression
known as? known as Shrinkage Regression.
In which regression, we In Linear Regession we calculate
calculate Root Mean Square Root Mean Square
84 3 Linear-Regression ElasticNet Regression Logistic Regression All
Error(RMSE) to predict the Error(RMSE) to predict the next
next weight value. weight value.
The______ is the standard The residual standard error is the
85 3 deviation of the observed Residual standard error Mean Difference Error Data Error All standard deviation of
residuals. the\nobserved residuals.
Which Regression is used Poisson regression is used when
86 3 when dependent variable has Poisson Regression Linear Regression Cox Regession Lasso Regression dependent variable has count
count data. data.
________________regression
Quasi-Poisson regression can
can handle both over-
87 3 Quasi-Poisson regression Cox Regression Elasticnet Regression Linear Regression handle both over-dispersion and
dispersion and under-
under-dispersion.\n
dispersion.\n
___ is the regularization
λ is the regularization parameter
88 3 parameter in Lasso λ θ Ω β
in lasso regression.
Regression?
Decision Tree is a hierarchical Decision Tree is a hierarchical
model that does the separation model that recursively does the
89 1 Recursion Pointers Greedy Approach Divide and Conquer
of the\ninput space into class separation of the\ninput space
regions using: into class regions
Learning Algorithm of Decision Decision Tree uses greedy
90 1 Greedy Approach Divide and Conquer Both None
Tree is: approach for learning algorithm.
Normal Distribution is also
91 1 Gausiann Distribution Bernoulli Distribution Naïve Bias Binary Distribution This is an explaination.
called?
Classification has how many There are 2 phases of
92 1 2 3 4 5
phases: classification.
"Every pair of features being Naïve Bias uses the principle that
classified is independent of every pair of features being
93 1 Naïve Bais Classifier Decision Tree Bernoulli Distribution Normal Distribution
each other".This principle is classified is independent of each
used by: other.
This equation is of which
94 2 Gausiann Distribution Binary Distribution Naïve Bias Gross-Entrpoy This is an explaination.
theorem?
In Naïve Bias, The Datasets
data sets are divided into two
95 2 are divided into how many 2 3 4 5
types in naïve bias.
types?
Decision trees can be used to Decision trees can be used to
96 2 predict non-categorical values Regression Trees Categorial trees Normal tree None predict non-categorical values is
is called? called regression trees
An attribute with____Gini
an attribute with lower Gini index
97 2 index should be preferred in a Lower Higher Recursive Negative
should be preferred.
decision tree.
In Naïve Bias, if any two If any two events A and B are
98 2 events A and B are P(A,B)=P(A)P(B) P(A,B)=P(A)/P(B) P(A,B)=P(B) P(A,B)=P(B)P/(A) independent,
independent, then, then,P(A,B)=P(A)P(B)
What is the measure of
Entropy is the measure of
99 2 uncertainty of a random Entropy. Gain Gini Index None
uncertainty of a random variable
variable in a decision tree.
Which of the following is not
100 2 Stable Easy to understand Easy to explain Easy to evaluate. this is an explaination.
true for decision trees?
Decision tree algorithm falls Decision tree algorithm falls
101 2 under the category of which Supervised Unsupervised Regression Classification under the category of supervised
learning? learning
False Positives and False One of the use Bayes Theorem is
102 2 Negatives is an application of Bayes' Theorem Binary Distribution Bernoulli Distribution Normal Distribution false positives and false
which theorem? negatives.
Decision Tree used in mining
There are 2 types of decision
103 2 the data are of how many 2 3 4 5
trees used in data mining.
types?
In Bayes' Theorem, P(A) and
P(A) and P(B) are the
P(B) are the probabilities of
probabilities of observing A and
104 3 observing A and B Marginal Probability Normal Distribution Bernoulli Distribution Parallel Algorithm.
B respectively; they are known
respectively; they are known
as the marginal probability.
as:
marks question A B C D ans
ID3 Algorithm in a decision ID3 stands for Iterative
105 3 Iterative Dichotomiser 3 (ID3) Interval Driven Interconnected Decision None
tree stands for? Dichotomiser 3 (ID3)
Probably the best way of
Probably the best way of
estimating performance for very
106 3 estimating performance for Boot Strapped Method Normal Distribution Naïve Bias Binary Distribution
small data sets is bootstrapped
very small\ndata sets is:
method
The Decision Tree works on Decision Tree works on
107 3 Disjunctive Normal Form Product of Sum Bijective Form Conjuctive Form
which form? Disjunctive normal form.
The decoupling of the class The decoupling of the class
conditional feature distributions conditional feature distributions
108 3 means that each distribution 1-D 2-D 3-D NONE means that each distribution can
can be independently estimated be independently estimated as a
as a________ distribution. one dimensional distribution.
Theoretical concept to evaluate
109 3 COLT PAC Model Naïve Bias Prediction. This is an explaination.
Classfiers is:
____________is a metric to Gini Index is a metric to measure
measure how often a randomly how often a randomly chosen
110 3 Gini Index Entropy Pointer Gross-Entrpoy
chosen element would be element would be incorrectly
incorrectly identified identified
The most notable types of The most notable types of
111 3 3 2 1 4
decision tree algorithms are: decision tree algorithms are 3
Which process is completed The recursive partition is
when the subset at a node all completed when the subset at a
112 3 Recursive Partitioning Termination Transformation Prediction.
has the same value of the target node all has the same value of
variable? the target variable
The_______ method reserves The holdout method reserves a
113 3 a certain amount for testing and Holdout Parallel Algorithm Naïve Bias Normal Distribution certain amount\nfor testing and
uses the remainder for training. uses the remainder for training
This equation is of which
114 3 Bayes' Theorem Normal Distribution Bernoulli Distribution Gross-Entrpoy This is an explaination.
theorem?
"Independence among the Independence among the
115 3 features". This is an assumption Naïve Bais Classifier Bernoulli Distribution Parallel Algorithm Binary Distribution features is an assumption in
in: Naïve bias.
Error rate obtained from error rate obtained from training
116 3 Resubstitution Error Grid Gini Index True error
training data is called: data is called resubstitution error.
In Decision Tree entropy is
117 3 proportional inverse High Less This is an explaination.
__________ to content.
In Decision Tree, No root-to-
No root-to-leaf path should
leaf path should contain the
118 3 Twice Once Thrice Four Times. contain the same discrete
same discrete attribute
attribute twice
____________.
Using_________, designers
Using data visualization methods,
can make information
119 1 Data Visualization Classification Regression Supervised Learning. designers can make information
understandable for
understandable for stakeholders.
stakeholders.
The additional visual methods
120 1 All Tree Map Parallel Coordinates Semantic Networks. This is an explaination.
include:
Data Visualization tools
121 1 Ms--Excel Tableau Power BI Jupyter This is an explaination.
Doesn’t include:
Which of the following requires
122 1 Javascript Knowledge to run All Chart.js Polymap Sigmajs This is an explaination.
the visualization tool?
Merits of Tableau doesn’t Merits of tableau doesn’t include
123 1 Cost Performance Usage Computation
include which factor: the cost factor.
Which of these is not a type of
124 1 Pictograph Bar-Graph Line-Chart Pie-Chart This is an explaination.
Big Data Visualization.
The drag-and-drop editor od
The drag-and-drop editor of
which tool makes it easy to
Infogram makes it easy to create
125 2 create professional-looking Infogram Google Chart Tableau Grafana
professional-looking designs
designs without a lot of visual
without a lot of visual design skill.
design skill.
How many V's are defined for There are 4 V's of Data
126 2 4 6 2 3
Data Visualization. visualization.
Which of the following is not a Tableau is a chargeable tool of
127 2 Tableau Google Chart Jupyter Hub-Spot CRM
free Data Visualization tool? data visualization.
Companies that work with
Companies that work with both
both traditional and big data
traditional and big data may use
128 2 use which technique to look at Pie-Chart Bar-Graph Stream graph Line-Chart
pie chart to look at customer
customer segments or market
segments or market shares
shares?
Visualization of Data includes
129 2 which of the following All Information Loss Visual Noise Large Image Perception. This is an explaination.
problems:
Mainly, Data Visualization has There are 5 main challenges to
130 2 5 6 4 2
how many types of challenges? data visualization.
marks question A B C D ans
Google charts uses
Which tool uses HTML5/SVG
131 2 Google Charts Jupyter Grafana Tableau HTML5/SVG since its browser
to visualize data
compatible.
According to Colin Ware’s According to Colin Ware’s
Information Visualization: Information Visualization:
132 2 Perception for Design, he 4 2 1 3 Perception for Design, he defines
defines_____ pre-attentive four pre-attentive visual
visual properties. properties
_____ is based on space-filling Tree map method is based on
133 2 visualization of hierarchical Tree-Map Stream graph Bar-graph Line-Chart space-filling visualization of
data. hierarchical data
Which graph shows the Gantt chart show the
dependency relationships dependency relationships
134 2 Gantt-Chart Line-Chart Pie-Chart Bar-Graph
between activities and current between activities and current
schedule status. schedule status.
Another name for distribution Non parametric data is also
135 2 Non parametric data Parametric Data static data Dynamic data
free data is: called distribution free data.
Which chart is used for Bar Graph is used for
comparison of values, such as Comparison of values, such as
136 2 sales performance for several Bar-Graph Gantt-Graph Line-Chart Pie-Chart sales performance for several
persons or businesses in a persons or businesses in a single
single time. time
Graphical Techniques are
_____________are graphics
graphics in the field of statistics
137 2 in the field of statistics used to Graphical-Techniques Line-Chart Regression Classification
used to visualize quantitative
visualize quantitative data.
data.
_____ can handle several Parallel Coordinates can handle
factors for a large number of several factors for a large
138 2 objects per single screen, so it Parallel Coordinates Stream graph Google Chart Jupyter number of objects per single
satisfies the data variety screen, so it satisfies the data
criterion. variety criterion
Chart.js provides how many
139 3 8 5 3 6 This is an explaination.
types of charts?
Which visualization tool
Grafana supports mixed data
supports mixed data sources,
sources, annotations, and
annotations, and customizable
140 3 Grafana Tableau Google Chart Jupyter customizable alert functions, and
alert functions, and it can be
it can be extended via hundreds
extended via hundreds of
of available plugins.
available plugins.
Which tool was created Datawrapper was created
141 3 specifically for adding charts Data Wrapper Tableau Google Chart Jupyter specifically for adding charts and
and maps to news stories. maps to news stories.
Conventional Visualization Mekko chart is a new technique
142 3 Mekko Chart Pie-Chart Bar-graph Histogram
methods doesn’t include: to visualize data.
_____________ is a type of a Streamgraph is a type of a
stacked area graph, which is stacked area graph, which is
143 3 displaced around a central axis, Streamgraph Bar-Graph Pie-Chart Line-Chart displaced around a central axis,
resulting in flowing and organic resulting in flowing and organic
shape. shape
Which visual tool includes over
Fusion charts includes over 150
144 3 150 chart types and 1,000 Fusion charts Tableau Google Chart Jupyter
chart types and 1,000 map types
map types?
Which graph/chart is a
A semantic network is a
graphical representation of
graphical representation of
logical relationship between
logical relationship between
different concepts. It generates
145 3 Semantic Networks Bar-Graph Pie-Chart Line-Chart different concepts. It generates
directed graph, the
directed graph, the combination
combination of nodes or
of nodes or vertices, edges or
vertices, edges or arcs, and
arcs, and label over each edge
label over each edge.
According to SAS we can According to SAS we can
process only______ of process only 1 kilobit of
146 3 1 Kilobit 1 Byte 1 Bit 1 MB
information per second on a information per second on a flat
flat screen. screen
There are____ steps for
147 3 4 5 3 6 This is an explaination.
interactive data visualization:
When working with big data, When working with big data,
companies can use which companies can use the line chart
visualization technique to track visualization technique to track
148 3 total application clicks by Line-Chart Bar-Graph Pie-Chart Stream graph total application clicks by weeks,
weeks, the average number of the average number of
complaints to the call center by complaints to the call center by
months, etc.\n\n months, etc.\n\n
Which of the following
149 1 All Facebook Netflix Adobe This is an explaination.
Enterprises use HBase?
marks question A B C D ans
Which NLP is used in the From 2010, Neural NLP is
150 1 Neural NLP Symbolic NLP Statical NLP None
present era? being used.
The Computer World magazine The Computer World magazine
states that unstructured states that unstructured
151 1 information might account for 70-80% 0.9 0.5 0.6 information might account for
more than______of all data in more than 70%–80% of all data
organizations. in organizations.
Almost all of the information Almost all of the information we
we use and share every day, use and share every day, such as
152 1 such as articles, documents and Unstructured Structured Semantic None articles, documents and e-mails,
e-mails, are are completely or partly
completely___________. unstructured
The Unstructured Information
Which standard provided a Management Architecture
common framework for (UIMA) standard provided a
Unstructured Information
processing information to Management common framework for
153 1 Management Architecture Data Architecure None
extract meaning and create Architecture for Data processing this information to
(UIMA)
structured data about the extract meaning and create
information? structured data about the
information.
The base Apache Hadoop The base Apache Hadoop
154 2 framework is composed of the 4 2 3 6 framework is composed of the
how many modules? four modules.
No-SQL doesn’t include
155 2 MS-SQL HBASE DyanoDB MongoDB This is an explaination.
which software?
There are _______main types There are 3 types of OLAP
156 2 3 2 5 6
of OLAP systems. systems.
SQL alternative in Apache HIVE-QL is the alternative to
157 2 HIVEQL BASEQL SPARK-QL H-QL
HIVE is called? SQL in Apche Hive family.
MapReduce program executes MapReduce program executes in
158 2 3 2 5 4
in how many stages? three stages.
How many types of NO-SQL There are 4 types of databases in
159 2 4 3 2 6
database are there? NO-SQL.
MapReduce is a processing
MapReduce is a processing
technique and a program
technique and a program model
160 2 model for distributed JAVA Python C++ R
for distributed computing based
computing based on which
on java
programming Language?
Hive supports how many Hive supports all four properties
161 2 4 3 2 1
properties of transactions? of transactions
HDFS consists of only one
HDFS consists of only one
162 2 Master Node Slave Node Both None Name Node that is called the
Name Node that is called as?
Master Node.
Which Apache Software is
needed to process massive Hbase to process massive
163 2 amounts of data for the Apache HBASE Apache Spark Apache-PIG Apache-mahout amounts of data for the purposes
purposes of natural-language of natural-language search
search?
Which database store data in a No-sql databases that store data
164 2 format other than relational NO-SQL HIVESQL SPARK-QL H-QL in a format other than relational
tables tables.
Which is a project of the Mahout is a project of the
Apache Software Foundation Apache Software Foundation to
to produce free produce free implementations of
165 2 implementations of distributed Apache Mahout Apache Spark Apache-PIG Apache HBASE distributed or otherwise scalable
or otherwise scalable machine machine learning algorithms
learning algorithms focused focused primarily on linear
primarily on linear algebra? algebra.
MapReduce model is a
Which model is a specialization
specialization of the split-apply-
166 2 of the split-apply-combine MapReduce Hadoop HBASE HIVE
combine strategy for data
strategy for data analysis?
analysis.
All Hadoop commands are
All Hadoop commands are invoked by the
167 2 $HADOOP_HOME/bin/hadoop $HADOOP/bin/hadoop $HADOOP_HOME/hadoop $HADOOP_HOME/bin
invoked by which command? $HADOOP_HOME/bin/hadoop
command
The table typically enforces the The table typically enforces the
schema when the data is schema when the data is loaded
loaded into the table. This into the table. This enables the
enables the database to make database to make sure that the
168 3 sure that the data entered Schema on Write Schema on Read Schema for Read Write None data entered follows the
follows the representation of representation of the table as
the table as specified by the specified by the table definition.
table definition. This design is This design is called schema on
called? write.
marks question A B C D ans
Which command formats the Namenode -format command
169 3 Namenode -format Node -format Name -format Format
DFS filesystem? formats the DFS file system.
Which command applies the
oiv applies the offline fsimage
170 3 offline fsimage viewer to an oiv fs fc ov
viewer to an fsimage.
fsimage?
Hadoop requires which Java
Hadoop requires Java Runtime
171 3 Runtime Environment (JRE) or 1.6 1.2 1.5 1
Environment (JRE) 1.6 or higher
higher version?
Every Data node sends a
Every Data node sends a
Heartbeat message to the
Heartbeat message to the Name
172 3 Name node every____ 3 2 4 1
node every 3 seconds and
seconds and conveys that it is
conveys that it is alive
alive.
HDFS can store upto1 TB of
173 3 HDFS can store files upto: 1 TB 1 GB 1ZB 1PB
files.
Which of the following is a HBASE is a popular wide
174 3 HBase SQL DyanoDB MongoDB
wide-column store? columnn store.
Which node acts as both a A slave or worker node acts as
175 3 DataNode and TaskTracker in Slave Node Data Node Admin Node Name Node both a DataNode and
Hadooop. TaskTracker.
HDFS system uses which HDFS system uses TCP/IP
176 3 TCP/IP TCP UDP IP
protocol for communication? sockets for communication
177 3 HDFS has how many services? 5 4 2 6 HDFS has five services.
____________is a data
HIVE is a data warehouse
warehouse software project
software project built on top of
178 3 built on top of Apache Hadoop Apache HIVE Apache Spark Apache-PIG Apache HBASE
Apache Hadoop for providing
for providing data query and
data query and analysis
analysis
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

1 of 5 20-03-2021, 15:01
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

2 of 5 20-03-2021, 15:01
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

3 of 5 20-03-2021, 15:01
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

4 of 5 20-03-2021, 15:01
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

5 of 5 20-03-2021, 15:01
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

1 of 5 20-03-2021, 15:03
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

2 of 5 20-03-2021, 15:03
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

3 of 5 20-03-2021, 15:03
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

4 of 5 20-03-2021, 15:03
Hadoop Online Quiz - Tutorialspoint https://www.tutorialspoint.com/hadoop/hadoop_online_quiz.htm

5 of 5 20-03-2021, 15:03
HADOOP MOCK TEST
http://www.tutorialspoint.com Copyright © tutorialspoint.com

This section presents you various set of Mock Tests related to Hadoop Framework. You can
download these sample mock tests at your local machine and solve offline at your convenience.
Every mock test is supplied with a mock test key to let you verify the final score and grade yourself.

HADOOP MOCK TEST I

Q 1 - The concept using multiple machines to process data stored in distributed

system is not new.

The High-performance computing HPC uses many computing machines to process

large volume of data stored in a storage area network SAN. As compared to HPC,
Hadoop

A - Can process a larger volume of data.

B - Can run on a larger number of machines than HPC cluster.

C - Can process data faster under the same network bandwidth as compared to HPC.

D - Cannot run compute intensive jobs.

Q 2 - Hadoop differs from volunteer computing in

A - Volunteers donating CPU time and not network bandwidth.

B - Volunteers donating network bandwidth and not CPU time.

C - Hadoop cannot search for large prime numbers.

D - Only Hadoop can use mapreduce.

Q 3 - As compared to RDBMS, Hadoop

A - Has higher data Integrity.

B - Does ACID transactions

C - IS suitable for read and write many times

D - Works better on unstructured and semi-structured data.

Q 4 - What is the main problem faced while reading and writing data in parallel from
multiple disks?

A - Processing high volume of data faster.

B - Combining data from multiple disks.

C - The software required to do this task is extremely costly.

D - The hardware required to do this task is extremely costly.

Q 5 - Which of the following is true for disk drives over a period of time?

A - Data Seek time is improving faster than data transfer rate.

B - Data Seek time is improving more slowly than data transfer rate.

C - Data Seek time and data transfer rate are both increasing proportionately.

D - Only the storage capacity is increasing without increase in data transfer rate.

Q 6 - Data locality feature in Hadoop means

A - store the same data across multiple nodes.

B - relocate the data from one node to another.

C - co-locate the data with the computing nodes.

D - Distribute the data across multiple nodes.

Q 7 - Which of these provides a Stream processing system used in Hadoop ecosystem?

A - Solr

B - Tez

C - Spark

D - Hive

Q 8 - HDFS files are designed for

A - Multiple writers and modifications at arbitrary offsets.

B - Only append at the end of file

C - Writing into a file only once.

D - Low latency data access.

Q 9 - A file in HDFS that is smaller than a single block size

A - Cannot be stored in HDFS.

B - Occupies the full block's size.

C - Occupies only the size it needs and not the full block.
D - Can span over multiple blocks.

Q 10 - HDFS block size is larger as compared to the size of the disk blocks so that

A - Only HDFS files can be stored in the disk used.

B - The seek time is maximum

C - Transfer of a large files made of multiple disk blocks is not possible.

D - A single file larger than the disk size can be stored across many disks in the cluster.

Q 11 - In a Hadoop cluster, what is true for a HDFS block that is no longer available
due to disk corruption or machine failure?

A - It is lost for ever

B - It can be replicated form its alternative locations to other live machines.

C - The namenode allows new client request to keep trying to read it.

D - The Mapreduce job process runs ignoring the block and the data stored in it.

Q 12 - Which utility is used for checking the health of a HDFS file system?

A - fchk

B - fsck

C - fsch

D - fcks

Q 13 - Which command lists the blocks that make up each file in the filesystem.

A - hdfs fsck / -files -blocks

B - hdfs fsck / -blocks -files

C - hdfs fchk / -blocks -files

D - hdfs fchk / -files -blocks

Q 14 - The datanode and namenode are respectiviley

A - Master and worker nodes

B - Worker and Master nodes

C - Both are worker nodes

D - None

Q 15 - In the local disk of the namenode the files which are stored persistently are −

A - namespace image and edit log

B - block locations and namespace image

C - edit log and block locations

D - Namespace image, edit log and block locations.

Q 16 - When a client communicates with the HDFS file system, it needs to

communicate with

A - only the namenode

B - only the data node

C - both the namenode and datanode

D - None of these

Q 17 - What mechanisms Hadoop uses to make namenode resilient to failure.

A - Take backup of filesystem metadata to a local disk and a remote NFS mount.

B - Store the filesystem metadata in cloud.

C - Use a machine with at least 12 CPUs

D - Using expensive and reliable hardware.

Q 18 - The main role of the secondary namenode is to

A - Copy the filesystem metadata from primary namenode.

B - Copy the filesystem metadata from NFS stored by primary namenode

C - Monitor if the primary namenode is up and running.

D - Periodically merge the namespace image with the edit log.

Q 19 - For the frequently accessed HDFS files the blocks are cached in

A - the memory of the datanode

B - in the memory of the namenode

C - Both A&B

D - In the memory of the client application which requested the access to these files.

Q 20 - User applications can instruct the namenode to cache the files by

A - adding cache file names to cache pool

B - adding cache config to cache pool

C - adding cache directive to cache pool

D - passing the file names as parameters to the cache pool

Q 21 - In Hadoop 2.x release HDFS federation means

A - Allowing namenodes to communicate with each other.

B - Allow a cluster to scale by adding more datanodes under one namenode.

C - Allow a cluster to scale by adding more namenodes.

D - Adding more physical memory to both namenode and datanode.

Q 22 - Under HDFS federation

A - Each namenode manages metadata of the entire filesystem.

B - Each namenode manages metadata of a portion of the filesystem.

C - Failure of one namenode causes loss of some metadata availability from the entire
filesystem.

D - Each datanode registers with each namenode.

Q 23 - The main goal of HDFS High availability is

A - Faster creation of the replicas of primary namenode.

B - To reduce the cycle time required to bring back a new primary namenode after existing
primary fails.

C - Prevent data loss due to failure of primary namenode.

D - Prevent the primary namenode form becoming single point of failure.

Q 24 - As part of the HDFS high availability a pair of primary namenodes are

configured. What is true for them?

A - When a client request comes, one of them chosen at random serves the request.

B - One of them is active while the other one remains powered off.

C - Datanodes send block reports to only one of the namenodes.

D - The standby node takes periodic checkpoints of active namenode’s namespace.

Q 25 - Zookeeper ensures that

A - All the namenodes are actively serving the client requests

B - Only one namenode is actively serving the client requests

C - A failover is triggered when any of the datanode fails.

D - A failover can not be started by hadoop administrator.

Q 26 - Under Hadoop High Availability, Fencing means

A - Preventing a previously active namenode from start running again.

B - Preventing the start of a failover in the event of network failure with the active namenode.
C - Preventing the power down to the previously active namenode.

D - Preventing a previously active namenode from writing to the edit log.

Q 27 - Which of the following is not a fencing mechanism for a previously active

namenode?

A - Disabling its network port via a remote management command.

B - Revoking its access to shared storage directory.

C - Formatting its disk drive.

D - STONITH

Q 28 - The property used to set the default filesystem for Hadoop in core-site.xml is-

A - filesystem.default

B - fs.default

C - fs.defaultFS

D - hdfs.default

Q 29 - The default replication factor for HDFS file system in hadoop is

A-1

B-2

C-3

D-4

Q 30 - When running on a pseudo distributed mode the replication factor is set to

A-2

B-1

C-0

D-3

Q 31 - For a HDFS directory the replication factorRF is

A - same as the RF of the files in that directory

B - Zero

C-3

D - Does not apply.

Q 32 - The following is not permitted on HDFS files

A - Deleting

B - Renaming

C - Moving

D - Executing.

ANSWER SHEET

Question Number Answer Key

1 C

2 A

3 D

4 B

5 B

6 C

7 C

8 B

9 C

10 D

11 B

12 B

13 A

14 B

15 A

16 C

17 A

18 D

19 A

20 C

21 C

22 B

23 B

24 D

25 B

26 D

27 C
28 B

29 C

30 B

31 D

32 D

Loading [MathJax]/jax/output/HTML-CSS/jax.js
HADOOP MOCK TEST
http://www.tutorialspoint.com Copyright © tutorialspoint.com

HADOOP MOCK TEST II

Q 1 - HDFS can be accessed over HTTP using

A - viewfs URI scheme

B - webhdfs URI scheme

C - wasb URI scheme

D - HDFS ftp

Q 2 - What is are true about HDFS?

A - HDFS filesystem can be mounted on a local client’s Filesystem using NFS.

B - HDFS filesystem can never be mounted on a local client’s Filesystem.

C - You can edit a existing record in HDFS file which is already mounted using NFS.

D - You cannot append to a HDFS file which is mounted using NFS.

Q 3 - The client reading the data from HDFS filesystem in Hadoop

A - gets the data from the namenode

B - gets the block location from the datanode

C - gets only the block locations form the namenode

D - gets both the data and block location from the namenode

Q 4 - Which scenario demands highest bandwidth for data transfer between nodes in
Hadoop?

A - Different nodes on the same rack

B - Nodes on different racks in the same data center.

C - Nodes in different data centers

D - Data on the same node.

Q 5 - The current block location of HDFS where data is being written to,

A - is visible to the client requesting for it.

B - Block locations are never visible to client requests.

C - May or may not be visible to the reader.

D - becomes visible only after the buffered data is committed.

Q 6 - Which of this is not a scheduler options available with YARN?

A - Optimal Scheduler

B - FIFO scheduler

C - Capacity scheduler

D - Fair scheduler

Q 7 - Which of the following is not a Hadoop operation mode?

A - Pseudo distributed mode

B - Globally distributed mode

C - Stand alone mode

D - Fully-Distributed mode

Q 8 - The difference between standalone and pseudo-distributed mode is

A - Stand alone cannot use map reduce

B - Stand alone has a single java process running in it.

C - Pseudo distributed mode does not use HDFS

D - Pseudo distributed mode needs two or more physical machines.

Q 9 - The hadoop frame work is written in

A - C++

B - Python

C - Java

D - GO

Q 10 - The hdfs command to create the copy of a file from a local system is
A - CopyFromLocal

B - copyfromlocal

C - CopyLocal

D - copyFromLocal

Q 11 - The hadfs command put is used to

A - Copy files from local file system to HDFS.

B - Copy files or directories from local file system to HDFS.

C - Copy files from from HDFS to local filesystem.

D - Copy files or directories from HDFS to local filesystem.

Q 12 - Underreplication in HDFS means-

A - No replication is happening in the data nodes.

B - Replication process is very slow in the data nodes.

C - The frequency of replication in data nodes is very low.

D - The number of replicated copies is less than as specified by the replication factor.

Q 13 - When the namenode finds that some blocks are over replicated, it

A - Stops the replication job in the entire hdfs file system.

B - It slows down the replication process for those blocks

C - It deletes the extra blocks.

D - It leaves the extra blocks as it is.

Q 14 - Which of the below property gets configured on core-site.xml ?

A - Replication factor

B - Directory names to store hdfs files.

C - Host and port where MapReduce task runs.

D - Java Environment variables.

Q 15 - Which of the below property gets configured on hdfs-site.xml ?

A - Replication factor

B - Directory names to store hdfs files.

C - Host and port where MapReduce task runs.

D - Java Environment variables.

Q 16 - Which of the below property gets configured on mapred-site.xml ?

A - Replication factor

B - Directory names to store hdfs files.

C - Host and port where MapReduce task runs.

D - Java Environment variables.

Q 17 - Which of the below property gets configured on hadoop-env.sh?

A - Replication factor

B - Directory names to store hdfs files

C - Host and port where MapReduce task runs

D - Java Environment variables.

Q 18 - The command to check if Hadoop is up and running is −

A - Jsp

B - Jps

C - Hadoop fs –test

D - None

Q 19 - The information mapping data blocks with their corresponding files is stored in

A - Data node

B - Job Tracker

C - Task Tracker

D - Namenode

Q 20 - The file in Namenode which stores the information mapping the data block
location with file name is −

A - dfsimage

B - nameimage

C - fsimage

D - image

Q 21 - The namenode knows that the datanode is active using a mechanism known as

A - heartbeats

B - datapulse

C - h-signal
D - Active-pulse

Q 22 - The nature of hardware for the namenode should be

A - Superior than commodity grade

B - Commodity grade

C - Does not matter

D - Just have more Ram than each of the data nodes

Q 23 - In Hadoop, Snappy and LZO are examples of

A - Mechanisms of file transport between data nodes

B - Mechanisms of data compression

C - Mechanisms of data Replication

D - Mechanisms of Data synchronization

Q 24 - Which of the below apache system deals with ingesting streaming data to
hadoop

A - Ozie

B - Kafka

C - Flume

D - Hive

Q 25 - The input split used in MapReduce indicates

A - The average size of the data blocks used as input for the program

B - The location details of where the first whole record in a block begins and the last whole
record in the block ends.

C - Splitting the input data to a MapReduce program into a size already configured in the
mapred-site.xml

D - None of these

Q 26 - The output of a mapper task is

A - The Key-value pair of all the records of the dataset.

B - The Key-value pair of all the records from the input split processed by the mapper

C - Only the sorted Keys from the input split

D - The number of rows processed by the mapper task.

Q 27 - The role of a Journal node is to

A - Report the location of the blocks in a data node

B - Report the edit log information of the blocks in the data node.

C - Report the Schedules when the jobs are going to run

D - Report the activity of various components handled by resource manager

Q 28 - The Zookeeper

A - Detects the failure of the namenode and elects a new namenode.

B - Detects the failure of datanodes and elects a new datanode.

C - Prevents the hardware from overheating by shutting them down.

D - Maintains a list of all the components IP address of the Hadoop cluster.

Q 29 - If the IP address or hostname of a datanode changes

A - The namenode updates the mapping between file name and block name

B - The namenode need not update mapping between file name and block name

C - The data in that data node is lost forever

D - There namenode has to be restarted

Q 30 - When a client contacts the namenode for accessing a file, the namenode
responds with

A - Size of the file requested.

B - Block ID of the file requested.

C - Block ID and hostname of any one of the data nodes containing that block.

D - Block ID and hostname of all the data nodes containing that block.

Q 31 - HDFS stands for

A - Highly distributed file system.

B - Hadoop directed file system

C - Highly distributed file shell

D - Hadoop distributed file system.

Q 32 - The Hadoop tool used for uniformly spreading the data across the data nodes is
named −

A - Scheduler

B - Balancer

C - Spreader
D - Reporter

Q 33 - In the secondary namenode the amount of memory needed is

A - Similar to that of primary node

B - Should be at least half of the primary node

C - Must be double of that of primary node

D - Depends only on the number of data nodes it is going to handle

ANSWER SHEET

Question Number Answer Key

1 B

2 A

3 C

4 C

5 D

6 A

7 B

8 B

9 C

10 D

11 B

12 D

13 C

14 B

15 A

16 C

17 D

18 B

19 D

20 C

21 A

22 A

23 B

24 C
25 B

26 B

27 B

28 A

29 B

30 D

31 D

32 B

33 A
HADOOP MOCK TEST
http://www.tutorialspoint.com Copyright © tutorialspoint.com

HADOOP MOCK TEST

Q 1 - The purpose of checkpoint node in a Hadoop cluster is to

A - Check if the namenode is active

B - Check if the fsimage file is in sync between namenode and secondary namenode

C - Merges the fsimage and edit log and uploads it back to active namenode.

D - Check which data nodes are unreachable.

Q 2 - When a backup node is used in a cluster there is no need of

A - Check point node

B - Secondary name node

C - Secondary data node

D - Rack awareness

Q 3 - Rack awareness in name node means

A - It is aware how many racks are available in the cluster

B - It is aware of the mapping between the node and the rack

C - It is aware of the number of nodes in each of the rack

D - It is aware which data nodes are unavailable in the cluster.

Q 4 - When a machine is declared as a datanode, the disk space in it

A - Can be used only for HDFS storage

B - Can be used for both HDFS and non-HDFs storage

C - Cannot be accessed by non-hadoop commands

D - cannot store text files.

Q 5 - When a file in HDFS is deleted by a user

A - it is lost forever

B - It goes to trash if configured.

C - It becomes hidden from the user but stays in the file system

D - File sin HDFS cannot be deleted

Q 6 - The source of HDFS architecture in Hadoop originated as

A - Google distributed filesystem

B - Yahoo distributed filesystem

C - Facebook distributed filesystem

D - Azure distributed filesystem

Q 7 - The inter process communication between different nodes in Hadoop uses

A - REST API

B - RPC

C - RMI

D - IP Exchange

Q 8 - The type of data Hadoop can deal with is

A - Structred

B - Semi-structured

C - Unstructured

D - All of the above

Q 9 - YARN stands for

A - Yahoo’s another resource name

B - Yet another resource negotiator

C - Yahoo’s archived Resource names

D - Yet another resource need.

Q 10 - The fully distributed mode of installationwithoutvirtualization needs a minimum of

A - 2 physical mashines

B - 3 Physical machines

C - 4 Physical machines

D - 1 Physical machine

Q 11 - Running Start-dfs.sh results in

A - Starting namenode and datanode

B - Starting namenode only

C - Starting datanode only

D - Starting namenode and resource manager

Q 12 - Which of the following is not a goal of HDFS?

A - Fault detection and recovery

B - Handle huge dataset

C - Prevent deletion of data

D - Provide high network bandwidth for data movement

Q 13 - The command “hadoop fs -test -z URI “ gives the result 0 if

A - if the path is a directory

B - if the path is a file

C - if the path is not empty

D - if the file is zero length

Q 14 - In HDFS the files cannot be

A - read

B - deleted

C - executed

D - Archived

Q 15 - hadoop fs –expunge

A - Gives the list of datanodes

B - Used to delete a file

C - Used to exchange a file between two datanodes.

D - Empties the trash.

Q 16 - All the files in a directory in HDFS can be merged together using

A - getmerge

B - putmerge

C - remerge

D - mergeall

Q 17 - The replication factor of a file in HDFS can be changed using

A - changerep

B - rerep

C - setrep

D - xrep

Q 18 - The comman used to copy a directory form one node to another in HDFS is

A - rcp

B - dcp

C - drcp

D - distcp

Q 19 - The archive file created in Hadoop always has the extension of

A - .hrc

B - .har

C - .hrh

D - .hrar

Q 20 - To unarchive an already archived file in haddop use the command

A - unrar

B - unhar

C - cp

D - cphar

Q 21 - The data from a remote hadoop cluster can

A - not be read by another hadoop cluster

B - be read using http

C - be read using hhtp

D - be read suing hftp

Q 22 - The purpose of starting namenode in the recovery mode is to

A - Recover a failed namenode

B - Recover a failed datanode

C - Recover data from one of the metadata storage locations

D - Recover data when there is only one metadata storage location

Q 23 - When you increase the number of files stored in HDFS, The memory required by
namenode

A - Increases

B - Decreases

C - Remains unchanged

D - May increase or decrease

Q 24 - If we increase the size of files stored in HDFS without increasing the number of
files, then the memory required by namenode

A - Decreases

B - Increases

C - Remains unchanged

D - May or may not increase

Q 25 - The current limiting factor to the size of a hadoop cluster is

A - Excess heat generated in data center

B - Upper limit of the network bandwidth

C - Upper limit of the RAM in namenode

D - 4000 data nodes

Q 26 - The decommission feature in hadoop is used for

A - Decommissioning the namenode

B - Decommissioning the data nodes

C - Decommissioning the secondary namenode.

D - Decommissioning the entire Hadoop cluster.

Q 27 - You can reserve the amount of disk usage in a data node by configuring the
dfs.datanode.du.reserved in which of the following file
A - Hdfs-site.xml

B - Hdfs-defaukt.xml

C - Core-site.xml

D - Mapred-site.xml

Q 28 - The namenode loses its only copy of fsimage file. We can recover this from

A - Datanodes

B - Secondary namenode

C - Checkpoint node

D - Never

Q 29 - In a HDFS system with block size 64MB we store a file which is less than 64MB.
Which of the following is true?

A - The file will consume 64MB

B - The file will consume more than 64MB

C - The file will consume less than 64MB.

D - Can not be predicted.

Q 30 - A running job in hadoop can

A - Be killed with a command

B - Can never be killed with a command

C - Can be killed only by shutting down the name node

D - Be paused and run again

Q 31 - The number of tasks a task tracker can accept depends on

A - Maximum memory available in the node

B - Not limited

C - Number of slots configured in it

D - As decided by the jobTracker

Q 32 - When a jobTracker schedules a task is first looks for

A - A node with empty slot in the same rack as datanode

B - Any node on the same rack as the datanode

C - Any node on the rack adjacent to rack of the datanode

D - Just any node in the cluster

Q 33 - The heartbeat signal are sent from

A - JObtracker to Tasktracker

B - Tasktracker to Job tracker

C - Jobtracker to namenode

D - Tasktracker to namenode

ANSWER SHEET

Question Number Answer Key

1 C

2 A

3 B

4 B

5 B

6 A

7 B

8 D

9 B

10 A

11 A

12 C

13 D

14 C

15 D

16 A

17 C

18 D

19 B

20 C

21 D

22 D

23 A

24 A

25 C
26 B

27 A

28 C

29 C

30 A

31 C

32 A

33 B

Loading [MathJax]/jax/output/HTML-CSS/jax.js
HADOOP MOCK TEST
http://www.tutorialspoint.com Copyright © tutorialspoint.com

HADOOP MOCK TEST IV

Q 1 - When a jobTracker schedules a task is first looks for

A - A node with empty slot in the same rack as datanode

B - Any node on the same rack as the datanode

C - Any node on the rack adjacent to rack of the datanode

D - Just any node in the cluster

Q 2 - The heartbeat signal are sent from

A - JObtracker to Tasktracker

B - Tasktracker to Job tracker

C - Jobtracker to namenode

D - Tasktracker to namenode

Q 3 - Job tracker runs on

A - Namenode

B - Datanode

C - Secondary namenode

D - Secondary datanode

Q 4 - Which of the following is not a scheduling option available in YARN

A - Balanced scheduler

B - Fair scheduler
C - Capacity scheduler

D - FiFO schesduler.

Q 5 - What is the default input format?

A - The default input format is xml. Developer can specify other input formats as appropriate if
xml is not the correct input.

B - There is no default input format. The input format always should be specified.

C - The default input format is a sequence file format. The data needs to be preprocessed before
using the default input format.

D - The default input format is TextInputFormat with byte offset as a key and entire line as a
value.

Q 6 - Which one is not one of the big data feature?

A - Velocity

B - Veracity

C - volume

D - variety

Q 7 - Which technology is used to store data in Hadoop?

A - HBase

B - Avro

C - Sqoop

D - Zookeeper

Q 8 - Which technology is used to serialize the data in Hadoop?

A - HBase

B - Avro

C - Sqoop

D - Zookeeper

Q 9 - Which technology is used to import and export data in Hadoop?

A - HBase

B - Avro

C - Sqoop

D - Zookeeper
Q 10 - Which of the following technologies is a document store database?

A - HBase

B - Hive

C - Cassandra

D - CouchDB

Q 11 - Which one of the following is not true regarding to Hadoop?

A - It is a distributed framework.

B - The main algorithm used in it is Map Reduce

C - It runs with commodity hard ware

D - All are true

Q 12 - Which one of the following stores data?

A - Name node

B - Data node

C - Master node

D - None of these

Q 13 - Which one of the following nodes manages other nodes?

A - Name node

B - Data node

C - slave node

D - None of these

Q 14 - What is AVRO?

A - Avro is a java serialization library.

B - Avro is a java compression library.

C - Avro is a java library that create split table files.

D - None of these answers are correct.

Q 15 - Can you run Map - Reduce jobs directly on Avro data?

A - Yes, Avro was specifically designed for data processing via Map-Reduce.

B - Yes, but additional extensive coding is required.

C - No, Avro was specifically designed for data storage only.

D - Avro specifies metadata that allows easier data access. This data cannot be used as part of
map-reduce execution, rather input specification only.

Q 16 - What is distributed cache?

A - The distributed cache is special component on name node that will cache frequently used
data for faster client response. It is used during reduce step.

B - The distributed cache is special component on data node that will cache frequently used data
for faster client response. It is used during map step.

C - The distributed cache is a component that caches java objects.

D - The distributed cache is a component that allows developers to deploy jars for Map-Reduce
processing.

Q 17 - What is writable?

A - Writable is a java interface that needs to be implemented for streaming data to remote
servers.

B - Writable is a java interface that needs to be implemented for HDFS writes.

C - Writable is a java interface that needs to be implemented for MapReduce processing.

D - None of these answers are correct.

Q 18 - What is HBASE?

A - Hbase is separate set of the Java API for Hadoop cluster.

B - Hbase is a part of the Apache Hadoop project that provides interface for scanning large
amount of data using Hadoop infrastructure.

D - HBase is a part of the Apache Hadoop project that provides a SQL like interface for data
processing.

Q 19 - How does Hadoop process large volumes of data?

A - Hadoop uses a lot of machines in parallel. This optimizes data processing.

B - Hadoop was specifically designed to process large amount of data by taking advantage of
MPP hardware.

C - Hadoop ships the code to the data instead of sending the data to the code.

D - Hadoop uses sophisticated caching techniques on name node to speed processing of data.

Q 20 - When using HDFS, what occurs when a file is deleted from the command line?

A - It is permanently deleted if trash is enabled.

B - It is placed into a trash directory common to all users for that cluster.

C - It is permanently deleted and the file attributes are recorded in a log file.

D - It is moved into the trash directory of the user who deleted it if trash is enabled.
Q 21 - When archiving Hadoop files, which of the following statements are true?
Choosetwoanswers

1. Archived files will display with the extension .arc.

2. Many small files will become fewer large files.

3. MapReduce processes the original files names even after files are archived.

4. Archived files must be UN archived for HDFS and MapReduce to access the
original, small files.

5. Archive is intended for files that need to be saved but no longer accessed by
HDFS.

A-1&3

B-2&3

C-2&4

D-3&4

Q 22 - When writing data to HDFS what is true if the replication factor is three?
Choose2answers

1. Data is written to DataNodes on three separate racks ifRackAware.

2. The Data is stored on each DataNode with a separate file which contains a
checksum value.

3. Data is written to blocks on three different DataNodes.

4. The Client is returned with a success upon the successful writing of the first
block and checksum check.

A-1&3

B-2&3

C-3&4

D-1&4

Q 23 - Which of the following are among the duties of the Data Nodes in HDFS?

A - Maintain the file system tree and metadata for all files and directories.

B - None of the options is correct.

C - Control the execution of an individual map task or a reduce task.

D - Store and retrieve blocks when told to by clients or the NameNode.

E - Manage the file system namespace.

Q 24 - Which of the following components retrieves the input splits directly from
HDFS to determine the number of map tasks?

A - The NameNode.

B - The TaskTrackers.
C - The JobClient.

D - The JobTracker.

E - None of the options is correct.

Q 25 - The org.apache.hadoop.io.Writable interface declares which two methods?

Choose2answers.

1. public void readFieldsDataInput.

2. public void readDataInput.

3. public void writeFieldsDataOutput.

4. public void writeDataOutput.

A-1&4

B-2&3

C-3&4

D-2&4

Q 26 - Which one of the following statements is true regarding <key,value> pairs of a

MapReduce job?

A - A key class must implement Writable.

B - A key class must implement WritableComparable.

C - A value class must implement WritableComparable.

D - A value class must extend WritableComparable.

Q 27 - Which one of the following statements is false regarding the Distributed Cache?

A - The Hadoop framework will ensure that any files in the Distributed Cache are distributed to all
map and reduce tasks.

B - The files in the cache can be text files, or they can be archive files like zip and JAR files.

C - Disk I/O is avoided because data in the cache is stored in memory.

D - The Hadoop framework will copy the files in the Distributed Cache on to the slave node
before any tasks for the job are executed on that node.

Q 28 - Which one of the following is not a main component of HBase?

A - Region Server.

B - Nagios.

C - ZooKeeper.

D - Master Server.

Q 29 - Which of the following is false about RawComparator ?

A - Compare the keys by byte.

B - Performance can be improved in sort and suffle phase by using RawComparator.

C - Intermediary keys are deserialized to perform a comparison.

Q 30 - Which demon is responsible for replication of data in Hadoop?

A - HDFS.

B - Task Tracker.

C - Job Tracker.

D - Name Node.

E - Data Node.

Q 31 - Keys from the output of shuffle and sort implement which of the following
interface?

A - Writable.

B - WritableComparable.

C - Configurable.

D - ComparableWritable.

E - Comparable.

Q 32 - In order to apply a combiner, what is one property that has to be satisfied by

the values emitted from the mapper?

A - Combiner can be applied always to any data

B - Output of the mapper and output of the combiner has to be same key value pair and they can
be heterogeneous

C - Output of the mapper and output of the combiner has to be same key value pair. Only if the
values satisfy associative and commutative property it can be done.

ANSWER SHEET

Question Number Answer Key

1 A

2 B

3 A

4 A

5 D

6 B

7 A
8 B

9 C

10 D

11 D

12 B

13 A

14 A

15 A

16 B

17 C

18 B

19 C

20 C

21 B

22 C

23 D

24 D

25 A

26 B

27 C

28 B

29 C

30 D

31 B

32 C

Loading [MathJax]/jax/output/HTML-CSS/jax.js
Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 2. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image
Q.no 3. The ---------- attribute speciﬁes the number of dimensions or axes of the
array.

A : ndarray.size

B : ndarray.dtype

C : ndarray.ndim

D : ndarray.axes

Q.no 4. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 5. ---------------- submodule of scipy is dedicated to image processing.

A : ndarray

B : spatial

C : ndimage

D : special

Q.no 6. If number of input features are 3 then optimal hyperplane in support

vector machine is -------------

A : Single point

B : Line

C : 2-D Plane

D : Non linear line

Q.no 7. --------------- is an example of human generated unstructured data.

A : Text ﬁles

B : Satellite data

C : Sensor data
D : Seismic imagery data

Q.no 8. -------------- must be installed before you use scikit learn

A : Matlab

B : Scilab

C : Scipy

D : Numpy

Q.no 9. The procedure to organize items of a given collection into groups based on
some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association

Q.no 10. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 11. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 12. ------------- function is used to plot a histogram using matplotlib library

A : hist()

B : bar()

C : pie()
D : scatter()

Q.no 13. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Probability

B : Gini Index

C : Regression

D : Association

Q.no 14. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 15. Which of the following is not a type of clustering algorithm?

A : Density clustering

B : K-Mean clustering

C : Centroid clustering

D : Simple clustering

Q.no 16. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 17. -------------- data does not ﬁts into a data model due to variatins in contents.

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 18. ---------------- function multiply two matrices in numpy.

A : prod()

B : mult()

C : dot()

D:*

Q.no 19. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 20. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 21. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1

D : -2

Q.no 22. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing

B : Data Integration
C : Data Replication

D : Data loading

Q.no 23. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 24. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 25. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 26. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 27. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support
C : Lift

D : None of These

Q.no 28. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 29. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 30. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()

C : merge()

D : add()

Q.no 31. Data generated on twitter is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 32. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas
B : Numpy

C : matplotlib

D : ndarray

Q.no 33. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 34. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 35. --------------------- is raster graphic format with lossless compression.

A : EPS

B : PDF

C : PNG

D : PS

Q.no 36. ------------------is a ﬂow-chart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and
leaf nodes represent classes or class distributions.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 37. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 38. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 39. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 40. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These

Q.no 41. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence
D : lift

Q.no 42. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 43. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 44. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 45. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 46. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree
B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 47. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 48. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 49. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 50. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Q.no 51. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 52. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 53. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 54. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing

D : Data modeling

Q.no 55. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 57. Catelog design is complex process where the selection of items in a
business's catelog are often designed to complement each other so that buying
one item will lead to buying of another. So these items are often complements or
very related. Which algorith

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 59. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left
Answer for Question No 1. is a

Answer for Question No 2. is c

Answer for Question No 3. is c

Answer for Question No 4. is d

Answer for Question No 5. is c

Answer for Question No 6. is c

Answer for Question No 7. is a

Answer for Question No 8. is c

Answer for Question No 9. is b

Answer for Question No 10. is c

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is b

Answer for Question No 15. is d

Answer for Question No 16. is b

Answer for Question No 17. is b

Answer for Question No 18. is c

Answer for Question No 19. is a

Answer for Question No 20. is b

Answer for Question No 21. is b

Answer for Question No 22. is a

Answer for Question No 23. is d

Answer for Question No 24. is d

Answer for Question No 25. is c

Answer for Question No 26. is b

Answer for Question No 27. is a

Answer for Question No 28. is c

Answer for Question No 29. is a

Answer for Question No 30. is c

Answer for Question No 31. is b

Answer for Question No 32. is c

Answer for Question No 33. is a

Answer for Question No 34. is a

Answer for Question No 35. is c

Answer for Question No 36. is a

Answer for Question No 37. is a

Answer for Question No 38. is d

Answer for Question No 39. is a

Answer for Question No 40. is b

Answer for Question No 41. is a

Answer for Question No 42. is d

Answer for Question No 43. is b

Answer for Question No 44. is d

Answer for Question No 45. is b

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is d

Answer for Question No 49. is d

Answer for Question No 50. is c

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is b

Answer for Question No 54. is a

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is a

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 2. ----------- data that depends on data model and resides in a ﬁxed ﬁeld within
a record.

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 3. ---------- plot displays information as series of data points connected by
straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 4. ---------------- is about developing code to enable the machine to learn to

perform tasks and its basic principle is the automatic modeling of underlying that
have generated the collected data.

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 5. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 6. ---------------- method is dataframe reads ﬁrst n rows from dataframe

A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 7. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()
D : numpy.rad2sin(x1)

Q.no 8. Apriori algorithm is --------------- machine learning algorithm.

A : Un- Supervised

B : Supervised

C : Both of these

D : None of These

Q.no 9. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn

B : Pandas

C : Matplotlib

D : Numpy

Q.no 10. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 11. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 12. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised
C : Association

D : correlation

Q.no 13. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 14. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 15. ----------- is supervised machine learning algorithm outputs an optimal

hyperplane for given labled training data

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 16. ------------ rule mining is a technique to identify underlying relations

between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 17. -------------type of analytics descibes what happened in past

A : Descriptive
B : Prescriptive

C : Predictive

D : Probability

Q.no 18. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 19. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 20. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 21. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 22. ------- is basic data structure of pandas can be think of SQL table or a
spreadsheet data representation.
A : Dataframe

B : series

C : list

D : ndarray

Q.no 23. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 24. A perfect negative correlation is signiﬁed by -------------

A:1

B : -1

C:0

D:2

Q.no 25. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 26. In matplotlib library ------------- module supports basic image loading,
rescaling and display operations.

A : picture

B : image

C : pyplot

D : sympy
Q.no 27. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 28. ---------- is unsupervised technique aiming to divide a multivariate dataset

into clusters or groups.

A : KNN

B : Support Vector Machines

C : Regression

D : Cluster analysis

Q.no 29. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 30. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these

Q.no 31. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1
D : -2

Q.no 32. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 33. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 34. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 35. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 36. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees
D : Cluster analysis

Q.no 37. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 38. ------------------- is a one dimensiional array deﬁned in pandas that can be
used to store any data type.

A : Dict

B : series

C : ndarray

D : list

Q.no 39. To read image from a ﬁle into an array --------------- function is used.

A : matplotlib.pyplot.imshow()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 40. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 41. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous
C : Regressand

D : Estimated

Q.no 42. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 43. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 44. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 45. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 46. Which function from numpy used to return the truncated value of the
input elementwise?
A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 47. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 48. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 49. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 50. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift
Q.no 51. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 52. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 53. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 54. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 55. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows
D : ncols

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 57. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 59. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()

Q.no 60. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities
A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability
Answer for Question No 1. is c

Answer for Question No 2. is a

Answer for Question No 3. is b

Answer for Question No 4. is b

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is d

Answer for Question No 11. is d

Answer for Question No 12. is b

Answer for Question No 13. is a

Answer for Question No 14. is a

Answer for Question No 15. is b

Answer for Question No 16. is d

Answer for Question No 17. is a

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is c

Answer for Question No 25. is a

Answer for Question No 26. is b

Answer for Question No 27. is c

Answer for Question No 28. is d

Answer for Question No 29. is a

Answer for Question No 30. is c

Answer for Question No 31. is b

Answer for Question No 32. is c

Answer for Question No 33. is a

Answer for Question No 34. is d

Answer for Question No 35. is c

Answer for Question No 36. is d

Answer for Question No 37. is d

Answer for Question No 38. is b

Answer for Question No 39. is b

Answer for Question No 40. is c

Answer for Question No 41. is a

Answer for Question No 42. is c

Answer for Question No 43. is a

Answer for Question No 44. is d

Answer for Question No 45. is b

Answer for Question No 46. is b

Answer for Question No 47. is b

Answer for Question No 48. is a

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is b

Answer for Question No 53. is b

Answer for Question No 54. is d

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is b

Answer for Question No 59. is c

Answer for Question No 60. is a

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 2. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation
Q.no 3. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 4. To save or write dataframe data into csv ﬁle -------- function is used

A : write_csv()

B : write_ﬁle()

C : csv_read()

D : to_csv()

Q.no 5. ------------ uses a tree structure to specify sequences ofdecisions and

consequences.

A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 6. ---------------- is about developing code to enable the machine to learn to

perform tasks and its basic principle is the automatic modeling of underlying that
have generated the collected data.

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 7. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()
D : numpy.rad2sin(x1)

Q.no 8. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 9. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 10. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 11. Unsupervised learning makes sense of ------------- data without having any
predeﬁned dataset for its training.

A : unlabled

B : labeled

C : semi-labled

D : Empty dataset

Q.no 12. Correlation coeﬃcient values lies between----- and ---

A : -1 and +1

B : -1 and 0
C : 0 and 1

D : 0 and inﬁnite

Q.no 13. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised

C : Association

D : correlation

Q.no 14. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 15. ------------ type of plots show all individual data points without connected
with lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 16. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 17. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Information Gain
B : Probability

C : Regression

D : Association

Q.no 18. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 19. -------------- charts represents categorical data with retangular bars

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 20. In correlation both values are always---------------

A : Random

B : sequential

C : Same

D : from same group

Q.no 21. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 22. A ---------- is an example of the most widely used machine learning
algorithms much of its popularity is because it can be adapted to almost any type
od data.
A : Clustering

B : Regression

C : Decision trees

D : Apriori

Q.no 23. ------ is a classiﬁcation technique relies on the naïve assumption that
input variables are independent of each other.

A : KNN

B : NAïve Bayes

C : Regression

D : Support vector machine

Q.no 24. ----------- phase of the data analytics lifecycle usually takes the longest
time.

A : Data Preparation

B : Model Planning

C : Model Building

D : Communicate Results

Q.no 25. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas

B : Numpy

C : matplotlib

D : ndarray

Q.no 26. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these
Q.no 27. Which statement will create 5 x 5 array ﬁlled with all values 1

A : x=numpy.ones((5,5))

B : x=numpy.ones(5)

C : x=numpy.zeros((5,5))

D : x=numpy.eye((5,5))

Q.no 28. Which function returns the identity array with n x n dimension with its
main diagonal set to ones and all other elements to zero.

A : numpy.ones()

B : numpy.zeros()

C : numpy.ﬁll()

D : numpy.identity()

Q.no 29. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 30. In this type of clustring each data type either belongs to acluster
completely or not.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 31. ---------- function used to add two numppy arrays elementwise.

A : numpy.add(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)
D : numpy.addition(x1,x2)

Q.no 32. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line

Q.no 33. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 34. If a=np.array([1,2,3,4,5,6,7,8,9,10]) then a[2,5,1] will produce output----------

--------

A : 3, 4, 5

B : 3,4,5,6

C : 2,3,4,5

D : 1,2,3,4,5

Q.no 35. Identify the machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 36. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning
C : Data Visualization

D : software tester

Q.no 37. --------------------- is raster graphic format with lossless compression.

A : EPS

B : PDF

C : PNG

D : PS

Q.no 38. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 39. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 40. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 41. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right
B : on

C : sort

D : how

Q.no 42. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 43. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 44. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 45. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 46. Which of the following statement will create an axes at the top right
corner of the current ﬁgure
A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 47. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 48. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 49. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 50. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4
Q.no 51. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 52. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting

D : broadcasting

Q.no 53. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 54. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 55. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows
D : ncols

Q.no 56. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 59. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()
Q.no 60. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()
Answer for Question No 1. is a

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is d

Answer for Question No 5. is b

Answer for Question No 6. is b

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is d

Answer for Question No 10. is b

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is b

Answer for Question No 15. is c

Answer for Question No 16. is d

Answer for Question No 17. is a

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is c

Answer for Question No 22. is c

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is d

Answer for Question No 29. is b

Answer for Question No 30. is a

Answer for Question No 31. is a

Answer for Question No 32. is c

Answer for Question No 33. is c

Answer for Question No 34. is a

Answer for Question No 35. is d

Answer for Question No 36. is d

Answer for Question No 37. is c

Answer for Question No 38. is d

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is d

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is d

Answer for Question No 48. is a

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is c

Answer for Question No 54. is c

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is b

Answer for Question No 59. is c

Answer for Question No 60. is d

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Apriori algorithm is --------------- machine learning algorithm.

A : Un- Supervised

B : Supervised

C : Both of these

D : None of These

Q.no 2. CCTV footaage is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 3. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 4. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. The leaf nodes in decision trees returns the ---------

A : decision condition

B : class lables

C : decision on variables

D : test score

Q.no 6. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 7. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()
Q.no 8. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 9. -------------function reads an image from a ﬁle as an array.

A : imsave()

B : imread()

C : read()

D : None of these

Q.no 10. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 11. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 12. In numpy array , array indices always starts from --------

A:1

B : -1

C:0

D:2
Q.no 13. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 14. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 15. ------------ rule mining is a technique to identify underlying relations

between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 16. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 17. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 18. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 19. Which of the following is not a type of clustering algorithm?

A : Density clustering

B : K-Mean clustering

C : Centroid clustering

D : Simple clustering

Q.no 20. ---------- plot displays information as series of data points connected by
straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 21. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 22. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data
C : Text File data

D : Satellite imagery data

Q.no 23. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 24. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line

Q.no 25. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 26. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing

B : Data Integration

C : Data Replication

D : Data loading

Q.no 27. Which function returns the identity array with n x n dimension with its
main diagonal set to ones and all other elements to zero.
A : numpy.ones()

B : numpy.zeros()

C : numpy.ﬁll()

D : numpy.identity()

Q.no 28. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 29. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas

B : Numpy

C : matplotlib

D : ndarray

Q.no 30. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 31. A ---------- is an example of the most widely used machine learning
algorithms much of its popularity is because it can be adapted to almost any type
od data.

A : Clustering

B : Regression

C : Decision trees
D : Apriori

Q.no 32. Slop of the regression line of Y on X is also called as

A : Correlation coeﬃcient

B : Regression coeﬃcient

C : Association coeﬃcient

D : Probability

Q.no 33. -------- is the measure of the likeihood that an event will occure in a
random experiment

A : Probability

B : Correlation

C : Regression

D : Sample

Q.no 34. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 35. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 36. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()
C : merge()

D : add()

Q.no 37. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 38. Broadcasting is a powerful technique that allows numpy to work with
arrays of ------------- .

A : Same Shapes

B : Different Shapes

C : Same values

D : Different values

Q.no 39. If scatter diagram is drawn and all scatter points lie on a straight line
then it indicates-------

A : No correlation

B : Perfect correlation

C : Regression

D : Skewness

Q.no 40. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 41. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.
A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 42. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 43. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 44. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 45. Catelog design is complex process where the selection of items in a
business's catelog are often designed to complement each other so that buying
one item will lead to buying of another. So these items are often complements or
very related. Which algorith

A : Decision tree

B : Association Rule Mining

C : Clustering
D : Support vector machine

Q.no 46. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 47. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 48. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 49. For testing accuracy of a machine learning algorithm whole data set
should be devided into trainin and testing datasets. Which of the following is
good preportion for train-test spliting?

A : Train- 70%, Test - 30%

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 50. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()
B : trunc()

C : del()

D : remove_decimal()

Q.no 51. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 52. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 53. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 54. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree
Q.no 55. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 56. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 57. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 58. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 59. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()
C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 60. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()
Answer for Question No 1. is a

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is b

Answer for Question No 6. is c

Answer for Question No 7. is d

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is a

Answer for Question No 11. is c

Answer for Question No 12. is c

Answer for Question No 13. is a

Answer for Question No 14. is a

Answer for Question No 15. is d

Answer for Question No 16. is b

Answer for Question No 17. is b

Answer for Question No 18. is a

Answer for Question No 19. is d

Answer for Question No 20. is b

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is a

Answer for Question No 24. is c

Answer for Question No 25. is d

Answer for Question No 26. is a

Answer for Question No 27. is d

Answer for Question No 28. is c

Answer for Question No 29. is c

Answer for Question No 30. is b

Answer for Question No 31. is c

Answer for Question No 32. is b

Answer for Question No 33. is a

Answer for Question No 34. is a

Answer for Question No 35. is a

Answer for Question No 36. is c

Answer for Question No 37. is c

Answer for Question No 38. is b

Answer for Question No 39. is b

Answer for Question No 40. is d

Answer for Question No 41. is b

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is b

Answer for Question No 46. is a

Answer for Question No 47. is c

Answer for Question No 48. is d

Answer for Question No 49. is a

Answer for Question No 50. is b

Answer for Question No 51. is b

Answer for Question No 52. is a

Answer for Question No 53. is b

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is d

Answer for Question No 57. is d

Answer for Question No 58. is a

Answer for Question No 59. is c

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 2. The procedure to organize items of a given collection into groups based on
some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association
Q.no 3. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 4. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 5. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 6. The -------- function creates a 2-D array with diagonal values 1 and rest
values zeros.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 7. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing
D : Data Structures

Q.no 8. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()

B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 9. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 10. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 11. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 12. If number of input features are 3 then optimal hyperplane in support
vector machine is -------------

A : Single point

B : Line
C : 2-D Plane

D : Non linear line

Q.no 13. ---------------- method is dataframe reads ﬁrst n rows from dataframe

A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 14. ------------ uses a tree structure to specify sequences ofdecisions and
consequences.

A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 15. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 16. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 17. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn
B : Pandas

C : Matplotlib

D : Numpy

Q.no 18. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 19. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 20. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 21. Which of the following plots is not used for multidimensional
visualization?

A : Andrrews Curves

B : Prallel Chart

C : Deviation Chart

D : Bar

Q.no 22. -------- is the measure of the likeihood that an event will occure in a
random experiment
A : Probability

B : Correlation

C : Regression

D : Sample

Q.no 23. The ----- algorithm is the simplest machine learning algorithm, which
building the model consists only of storing the training dataset. To make a
prediction for a new data point, the algorithm ﬁnds the closest data points in the
training dataset i.e its

A : Apriori

B : K-Nearest Neighbors

C : K-Means

D : Decision Trees

Q.no 24. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1

C:0

D:2

Q.no 25. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 26. To set x Axis lable of a ﬁgure----------- function is used

A : set_title()

B : set_lable()

C : set_xlabel()
D : get_xlabel()

Q.no 27. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1

D : 10

Q.no 28. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 29. ------------ is an indication of how frequently the itemset appears in the
dataset in association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 30. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 31. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive
C : Predictive

D : Probability

Q.no 32. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 33. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()

C : merge()

D : add()

Q.no 34. ------------ is 2-D data structure deﬁned in pandas in which data arranged in
rows and columns.

A : Series

B : Dataframe

C : ndarray

D : list

Q.no 35. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 36. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing
B : Data Integration

C : Data Replication

D : Data loading

Q.no 37. Entropy is a measure of the randomness in the information being

processed.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 38. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 39. ------- is basic data structure of pandas can be think of SQL table or a
spreadsheet data representation.

A : Dataframe

B : series

C : list

D : ndarray

Q.no 40. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These
Q.no 41. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 42. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 43. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()

Q.no 44. Catelog design is complex process where the selection of items in a
business's catelog are often designed to complement each other so that buying
one item will lead to buying of another. So these items are often complements or
very related. Which algorith

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 45. For testing accuracy of a machine learning algorithm whole data set
should be devided into trainin and testing datasets. Which of the following is
good preportion for train-test spliting?
A : Train- 70%, Test - 30%

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 46. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting

D : broadcasting

Q.no 47. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 48. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 49. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left
Q.no 50. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 51. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 52. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 53. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows

D : ncols

Q.no 54. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution
D : Probability

Q.no 55. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 56. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 57. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 58. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 59. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree
B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 60. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative
Answer for Question No 1. is b

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is c

Answer for Question No 6. is c

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is a

Answer for Question No 11. is b

Answer for Question No 12. is c

Answer for Question No 13. is a

Answer for Question No 14. is b

Answer for Question No 15. is a

Answer for Question No 16. is b

Answer for Question No 17. is a

Answer for Question No 18. is d

Answer for Question No 19. is b

Answer for Question No 20. is d

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is b

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is a

Answer for Question No 29. is b

Answer for Question No 30. is c

Answer for Question No 31. is a

Answer for Question No 32. is a

Answer for Question No 33. is c

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is a

Answer for Question No 37. is a

Answer for Question No 38. is b

Answer for Question No 39. is a

Answer for Question No 40. is b

Answer for Question No 41. is d

Answer for Question No 42. is b

Answer for Question No 43. is c

Answer for Question No 44. is b

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is d

Answer for Question No 52. is b

Answer for Question No 53. is a

Answer for Question No 54. is a

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is c

Answer for Question No 58. is d

Answer for Question No 59. is b

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 2. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 3. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 4. -------------- data does not ﬁts into a data model due to variatins in contents.

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 6. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 7. Which of the following function is used to create an array of speciﬁed

shape but ﬁlled with random values.

A : numpy.random.ran()

B : rank

C : random.ﬁll()
D : numpy.ﬁllrandom()

Q.no 8. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 9. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 10. The -------- function creates a 2-D array with all values 0 (zeros).

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 11. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 12. The -------- function creates a 2-D array with diagonal values 1 and rest
values zeros.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()
D : numpy.empty()

Q.no 13. Pandas provide ----------- method in order to get label based indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 14. The ---------- attribute speciﬁes the number of dimensions or axes of the
array.

A : ndarray.size

B : ndarray.dtype

C : ndarray.ndim

D : ndarray.axes

Q.no 15. In support vector machines if input features are 2 then the decision
boundries or hyperplane is ---------------.

A : 2-D plane

B : 3-D plane

C : Line

D : point

Q.no 16. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 17. ---- is an technique to learn from examples and experience, without being
explicitly programmed.

A : Machine Learning

B : Software Testing
C : Computer Science

D : Data mining

Q.no 18. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 19. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 20. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 21. ------------------is a ﬂow-chart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and
leaf nodes represent classes or class distributions.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 22. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 23. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 24. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 25. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1

C:0

D:2

Q.no 26. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 27. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 28. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 29. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 30. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 31. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models
Q.no 32. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 33. If a=np.array([1,2,3,4,5,6,7,8,9,10]) then a[2,5,1] will produce output----------

--------

A : 3, 4, 5

B : 3,4,5,6

C : 2,3,4,5

D : 1,2,3,4,5

Q.no 34. Slop of the regression line of Y on X is also called as

A : Correlation coeﬃcient

B : Regression coeﬃcient

C : Association coeﬃcient

D : Probability

Q.no 35. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 36. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1
D : 10

Q.no 37. A perfect negative correlation is signiﬁed by -------------

A:1

B : -1

C:0

D:2

Q.no 38. ---------- is unsupervised technique aiming to divide a multivariate dataset

into clusters or groups.

A : KNN

B : Support Vector Machines

C : Regression

D : Cluster analysis

Q.no 39. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 40. ------------ is an example of semi structured data

A : XML data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 41. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max
C : nrows

D : ncols

Q.no 42. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 43. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 44. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 45. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 46. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem
A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 47. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing

D : Data modeling

Q.no 48. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 49. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 50. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Q.no 51. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 52. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 53. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

Q.no 54. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 55. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support
C : Conﬁdence

D : lift

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 57. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 58. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 59. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 60. Which of the following function is not used to iterate over the rows of the
DataFrame.
A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is a

Answer for Question No 4. is b

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is b

Answer for Question No 11. is d

Answer for Question No 12. is c

Answer for Question No 13. is b

Answer for Question No 14. is c

Answer for Question No 15. is c

Answer for Question No 16. is a

Answer for Question No 17. is a

Answer for Question No 18. is b

Answer for Question No 19. is d

Answer for Question No 20. is d

Answer for Question No 21. is a

Answer for Question No 22. is c

Answer for Question No 23. is a

Answer for Question No 24. is d

Answer for Question No 25. is b

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is a

Answer for Question No 29. is a

Answer for Question No 30. is a

Answer for Question No 31. is d

Answer for Question No 32. is b

Answer for Question No 33. is a

Answer for Question No 34. is b

Answer for Question No 35. is b

Answer for Question No 36. is b

Answer for Question No 37. is c

Answer for Question No 38. is d

Answer for Question No 39. is b

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is a

Answer for Question No 43. is d

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is b

Answer for Question No 50. is c

Answer for Question No 51. is b

Answer for Question No 52. is b

Answer for Question No 53. is a

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is d

Answer for Question No 58. is c

Answer for Question No 59. is b

Answer for Question No 60. is d

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Unsupervised learning makes sense of ------------- data without having any
predeﬁned dataset for its training.

A : unlabled

B : labeled

C : semi-labled

D : Empty dataset

Q.no 2. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots
Q.no 3. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 4. ---------------- function multiply two matrices in numpy.

A : prod()

B : mult()

C : dot()

D:*

Q.no 5. If number of input features are 3 then optimal hyperplane in support

vector machine is -------------

A : Single point

B : Line

C : 2-D Plane

D : Non linear line

Q.no 6. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 7. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability
Q.no 8. Pandas provide ----------- method in order to get label based indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 9. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 10. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 11. The leaf nodes in decision trees returns the ---------

A : decision condition

B : class lables

C : decision on variables

D : test score

Q.no 12. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 13. The -------- function creates a 2-D array with all values 0 (zeros).

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 14. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 15. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 16. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 17. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees
D : Cluster analysis

Q.no 18. ----------- is supervised machine learning algorithm outputs an optimal

hyperplane for given labled training data

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 19. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()

B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 20. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 21. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 22. -------- is most important language for Data Science.

A : Java

B : Ruby
C:R

D : None of these

Q.no 23. ------------ is 2-D data structure deﬁned in pandas in which data arranged in
rows and columns.

A : Series

B : Dataframe

C : ndarray

D : list

Q.no 24. ------------------is a ﬂow-chart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and
leaf nodes represent classes or class distributions.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 25. Which of the following is not used for 2-D Visualisation?

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 26. The -------- of a numpy array is a tuple of integers giving the size of the
array along each dimension.

A : axes

B : rank

C : shape

D : size

Q.no 27. Pandas provide ----------- method in order to get purly integer based
indexing.
A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 28. --------- in decision tree measures how much information a feature gives us
about the class

A : Information Gain

B : Posterior probability

C : Prior probability

D : probability

Q.no 29. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 30. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 31. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays
D : Apriori

Q.no 32. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 33. The objective of --------- algorithm is to ﬁnd a hyperplane in an N-

dimensional space that distinctly classiﬁes the data points.

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 34. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 35. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 36. In matplotlib ------------- function groups smaller axes that can exist
togather within a single ﬁgure.

A : subplot()
B : divide_ﬁgure()

C : add_ﬁg()

D : group_ﬁg()

Q.no 37. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 38. Entropy is a measure of the randomness in the information being

processed.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 39. ---------- function used to add two numppy arrays elementwise.

A : numpy.add(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.addition(x1,x2)

Q.no 40. In this type of clustring each data type either belongs to acluster
completely or not.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 41. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------
A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 42. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 43. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 44. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 45. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)
Q.no 46. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 47. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 48. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 49. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 50. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting
D : broadcasting

Q.no 51. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 52. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 53. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 54. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 55. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes
B : Canvas

C : Figure

D : FigureCanvas

Q.no 56. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 57. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 58. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()
Answer for Question No 1. is a

Answer for Question No 2. is c

Answer for Question No 3. is a

Answer for Question No 4. is c

Answer for Question No 5. is c

Answer for Question No 6. is a

Answer for Question No 7. is b

Answer for Question No 8. is b

Answer for Question No 9. is a

Answer for Question No 10. is a

Answer for Question No 11. is b

Answer for Question No 12. is b

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is a

Answer for Question No 16. is d

Answer for Question No 17. is d

Answer for Question No 18. is b

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is c

Answer for Question No 22. is c

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is a

Answer for Question No 29. is b

Answer for Question No 30. is d

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is c

Answer for Question No 35. is d

Answer for Question No 36. is a

Answer for Question No 37. is d

Answer for Question No 38. is a

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is a

Answer for Question No 43. is b

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is c

Answer for Question No 47. is b

Answer for Question No 48. is d

Answer for Question No 49. is d

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is d

Answer for Question No 53. is b

Answer for Question No 54. is d

Answer for Question No 55. is d

Answer for Question No 56. is a

Answer for Question No 57. is c

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 2. ------------- function is used to plot a histogram using matplotlib library

A : hist()

B : bar()

C : pie()

D : scatter()
Q.no 3. ------------ rule mining is a technique to identify underlying relations
between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 4. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 5. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()

Q.no 6. In numpy array , array indices always starts from --------

A:1

B : -1

C:0

D:2

Q.no 7. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 8. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 9. In --------- learning the training is controlled by an external supervisor or

teacher.

A : Un- Supervised

B : Supervised

C : semi-supervied

D : group

Q.no 10. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 11. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 12. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()
B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 13. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 14. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised

C : Association

D : correlation

Q.no 15. In support vector machines if input features are 2 then the decision
boundries or hyperplane is ---------------.

A : 2-D plane

B : 3-D plane

C : Line

D : point

Q.no 16. ---------------- submodule of scipy is dedicated to image processing.

A : ndarray

B : spatial

C : ndimage

D : special

Q.no 17. ------------ uses a tree structure to specify sequences ofdecisions and
consequences.
A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 18. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 19. The procedure to organize items of a given collection into groups based
on some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association

Q.no 20. matplotlib.pyplot.imread() function is used to ---------------

A : save image

B : read image

C : copy image

D : show image

Q.no 21. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models
Q.no 22. Pandas provide ----------- method in order to get purly integer based
indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 23. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 24. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 25. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 26. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 27. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These

Q.no 28. ------------------is a ﬂow-chart like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and
leaf nodes represent classes or class distributions.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 29. Which of the following is not used for 2-D Visualisation?

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 30. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 31. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 32. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 33. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 34. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 35. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 36. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support
C : Lift

D : None of These

Q.no 37. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 38. To set x Axis lable of a ﬁgure----------- function is used

A : set_title()

B : set_lable()

C : set_xlabel()

D : get_xlabel()

Q.no 39. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 40. In SciPy ---------- submodule is dedicated to image processing.

A : ndimage

B : ndarray

C : signal

D : io

Q.no 41. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree
B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 42. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 43. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 44. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 45. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

Q.no 46. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.
A : extract()

B : transform()

C : infer()

D : classify()

Q.no 47. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 48. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 49. For testing accuracy of a machine learning algorithm whole data set
should be devided into trainin and testing datasets. Which of the following is
good preportion for train-test spliting?

A : Train- 70%, Test - 30%

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 50. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN
D : None of These

Q.no 51. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows

D : ncols

Q.no 52. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 53. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 54. In this type of clustring instead of putting each data point into a separate
cluster a probability or likelihood of that data point to be in those clusters is
assigned.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 55. In regression the dependent variable is also called as ------------

A : Regression
B : Continuous

C : Regressand

D : Independent

Q.no 56. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 57. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 58. Catelog design is complex process where the selection of items in a
business's catelog are often designed to complement each other so that buying
one item will lead to buying of another. So these items are often complements or
very related. Which algorith

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 59. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()
D : subplot()

Q.no 60. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is d

Answer for Question No 6. is c

Answer for Question No 7. is b

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is c

Answer for Question No 11. is d

Answer for Question No 12. is a

Answer for Question No 13. is a

Answer for Question No 14. is b

Answer for Question No 15. is c

Answer for Question No 16. is c

Answer for Question No 17. is b

Answer for Question No 18. is a

Answer for Question No 19. is b

Answer for Question No 20. is b

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is c

Answer for Question No 24. is d

Answer for Question No 25. is d

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is a

Answer for Question No 29. is c

Answer for Question No 30. is a

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is c

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is a

Answer for Question No 37. is c

Answer for Question No 38. is c

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is b

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is b

Answer for Question No 53. is c

Answer for Question No 54. is b

Answer for Question No 55. is c

Answer for Question No 56. is d

Answer for Question No 57. is a

Answer for Question No 58. is b

Answer for Question No 59. is d

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Correlation coeﬃcient values lies between----- and ---

A : -1 and +1

B : -1 and 0

C : 0 and 1

D : 0 and inﬁnite

Q.no 2. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 3. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 4. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. -------------function reads an image from a ﬁle as an array.

A : imsave()

B : imread()

C : read()

D : None of these

Q.no 6. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees

D : Cluster analysis

Q.no 7. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 8. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 9. matplotlib.pyplot.imread() function is used to ---------------

A : save image

B : read image

C : copy image

D : show image

Q.no 10. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 11. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 12. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Information Gain

B : Probability

C : Regression

D : Association

Q.no 13. ------------ means part of population chosen for participation in the study
A : Population

B : Sample

C : Association

D : Correlation

Q.no 14. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 15. ------------- function is used to save image into an ndarray.

A : imsave()

B : imread()

C : save()

D : isave()

Q.no 16. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 17. ------- answers the question "What will happen in future?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 18. ---------------- method is dataframe reads ﬁrst n rows from dataframe
A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 19. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 20. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 21. -------- is uses a tree structure to specify sequence of decisions and
consequences.

A : KNN

B : NAïve Bayes

C : Regression

D : Decision Tree

Q.no 22. Which statement will create 5 x 5 array ﬁlled with all values 1

A : x=numpy.ones((5,5))

B : x=numpy.ones(5)

C : x=numpy.zeros((5,5))

D : x=numpy.eye((5,5))
Q.no 23. In matplotlib library ------------- module supports basic image loading,
rescaling and display operations.

A : picture

B : image

C : pyplot

D : sympy

Q.no 24. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 25. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 26. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these

Q.no 27. The ----- algorithm is the simplest machine learning algorithm, which
building the model consists only of storing the training dataset. To make a
prediction for a new data point, the algorithm ﬁnds the closest data points in the
training dataset i.e its

A : Apriori

B : K-Nearest Neighbors
C : K-Means

D : Decision Trees

Q.no 28. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 29. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 30. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 31. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 32. In this type of algorithms inputs are provided but not the desired output.
A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 33. The objective of --------- algorithm is to ﬁnd a hyperplane in an N-

dimensional space that distinctly classiﬁes the data points.

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 34. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 35. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 36. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line
Q.no 37. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 38. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 39. ------------ is an indication of how frequently the itemset appears in the
dataset in association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 40. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 41. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing
D : Data modeling

Q.no 42. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 43. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 44. ------------ algorithm models a series of logical If-Then- Else decision
statements, there is no underlying assumption of a linear or non-linear
relationship between the input variables and response variables.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 45. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 46. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------
A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 47. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 48. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 49. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 50. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()
Q.no 51. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 52. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 53. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 54. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 55. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 56. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 57. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 58. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 59. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 60. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.
A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is c

Answer for Question No 4. is a

Answer for Question No 5. is b

Answer for Question No 6. is d

Answer for Question No 7. is d

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is d

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is a

Answer for Question No 16. is d

Answer for Question No 17. is c

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is b

Answer for Question No 25. is d

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is b

Answer for Question No 29. is b

Answer for Question No 30. is a

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is a

Answer for Question No 35. is a

Answer for Question No 36. is c

Answer for Question No 37. is a

Answer for Question No 38. is d

Answer for Question No 39. is b

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is b

Answer for Question No 43. is a

Answer for Question No 44. is b

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is d

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is d

Answer for Question No 54. is d

Answer for Question No 55. is b

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is a

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 2. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees

D : Cluster analysis
Q.no 3. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 4. ------------ type of plots show all individual data points without connected
with lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 5. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 6. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 7. In numpy array , array indices always starts from --------

A:1

B : -1

C:0
D:2

Q.no 8. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()

Q.no 9. ---------- plot displays information as series of data points connected by

straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 10. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 11. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 12. ---- is an technique to learn from examples and experience, without being
explicitly programmed.

A : Machine Learning

B : Software Testing
C : Computer Science

D : Data mining

Q.no 13. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 14. ------------- function is used to save image into an ndarray.

A : imsave()

B : imread()

C : save()

D : isave()

Q.no 15. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 16. ---------------- library from python provides eﬃcient versions of a large
number of machine learning algorithms.

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 17. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 18. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn

B : Pandas

C : Matplotlib

D : Numpy

Q.no 19. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 20. ---------------- is about developing code to enable the machine to learn to
perform tasks and its basic principle is the automatic modeling of underlying that
have generated the collected data.

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 21. -------- is the measure of the likeihood that an event will occure in a
random experiment

A : Probability

B : Correlation

C : Regression

D : Sample

Q.no 22. Entropy is a measure of the randomness in the information being

processed.
A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 23. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1

D : 10

Q.no 24. In SciPy ---------- submodule is dedicated to image processing.

A : ndimage

B : ndarray

C : signal

D : io

Q.no 25. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 26. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)
Q.no 27. Which of the following plots is not used for multidimensional
visualization?

A : Andrrews Curves

B : Prallel Chart

C : Deviation Chart

D : Bar

Q.no 28. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 29. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 30. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 31. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1
C:0

D:2

Q.no 32. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 33. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 34. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1

D : -2

Q.no 35. ------- changes the the arrangement of items form array so that shape of
array changes while maintaining the same number of dimensions.

A : numpy. Reshape()

B : numpy. Empty()

C : numpy. Flatten()

D : numpy.ravel()

Q.no 36. Identify the machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 37. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 38. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 39. ------------ is an example of semi structured data

A : XML data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 40. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 41. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.
A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 42. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 43. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

Q.no 44. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 45. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These
Q.no 46. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 47. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 48. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 49. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 50. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset
C : Data preprocessing

D : Data modeling

Q.no 51. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 52. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 53. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 54. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 55. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 56. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 57. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 58. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Answer for Question No 1. is a

Answer for Question No 2. is d

Answer for Question No 3. is d

Answer for Question No 4. is c

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is c

Answer for Question No 8. is d

Answer for Question No 9. is b

Answer for Question No 10. is d

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is c

Answer for Question No 16. is c

Answer for Question No 17. is c

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is b

Answer for Question No 21. is a

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is d

Answer for Question No 26. is b

Answer for Question No 27. is d

Answer for Question No 28. is d

Answer for Question No 29. is b

Answer for Question No 30. is d

Answer for Question No 31. is b

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is d

Answer for Question No 37. is d

Answer for Question No 38. is a

Answer for Question No 39. is a

Answer for Question No 40. is c

Answer for Question No 41. is a

Answer for Question No 42. is c

Answer for Question No 43. is a

Answer for Question No 44. is d

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is a

Answer for Question No 49. is d

Answer for Question No 50. is a

Answer for Question No 51. is c

Answer for Question No 52. is a

Answer for Question No 53. is d

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is c

---------------------------------------------------------------------------------------------------------------------
SET 1 MCQs
---------------------------------------------------------------------------------------------------------------------
According to analysts, for what can traditional IT systems provide a foundation when
they’re integrated with big data technologies like Hadoop?
(A) Big data management and data mining
(B) Data warehousing and business intelligence
(C) Management of Hadoop clusters
(D) Collecting and storing unstructured data
Answer
A

MCQ No - 2
What are the main components of Big Data?
(A) MapReduce
(B) HDFS
(C) YARN
(D) All of these
Answer
D

MCQ No - 3
What are the different features of Big Data Analytics?
(A) Open-Source
(B) Scalability
(C) Data Recovery
(D) All the above

Answer
D

MCQ No - 4
According to analysts, for what can traditional IT systems provide a foundation when
they’re integrated with big data technologies like Hadoop?
(A) Big data management and data mining
(B) Data warehousing and business intelligence
(C) Management of Hadoop clusters
(D) Collecting and storing unstructured data

Answer
A

MCQ No - 5
What are the four V’s of Big Data?
(A) Volume
(B) Velocity

OptimusPrime Page 1
(C) Variety
(D) All the above

Answer
D

All of the following accurately describe Hadoop, EXCEPT:

(A) Open-source
(B) Real-time
(C) Java-based
(D) Distributed computing approach

Answer
B

MCQ No - 7
___________ is general-purpose computing model and runtime system for distributed data
analytics.
(A) Mapreduce
(B) Drill
(C) Oozie
(D) None of the above

Answer
A

MCQ No - 8
The examination of large amounts of data to see what patterns or other useful information
can be found is known as
(A) Data examination
(B) Information analysis
(C) Big data analytics
(D) Data analysis

Answer
C

MCQ No - 9
Big data analysis does the following except
(A) Collects data
(B) Spreads data
(C) Organizes data
(D) Analyzes data

Answer
B

OptimusPrime Page 2
MCQ No - 10
What makes Big Data analysis difficult to optimize?
(A) Big Data is not difficult to optimize
(B) Both data and cost effective ways to mine data to make business sense out of it
(C) The technology to mine data
(D) All of the above

Answer
B

The new source of big data that will trigger a Big Data revolution in the years to come is
(A) Business transactions
(B) Social media
(C) Transactional data and sensor data
(D) RDBMS

Answer
C

MCQ No - 12
The unit of data that flows through a Flume agent is
(A) Log
(B) Row
(C) Event
(D) Record

Answer
C

MCQ No - 13
Listed below are the three steps that are followed to deploy a Big Data Solution except
(A) Data Ingestion
(B) Data Processing
(C) Data dissemination
(D) Data Storage

Answer
C

MCQ No - 14
Check below the best answer to "which industries employ the use of so-called "Big Data"
in their day to day operations?
(A) Weather forecasting
(B) Marketing
(C) Healthcare
(D) All of the above

OptimusPrime Page 3
Answer
D

MCQ No - 15
There are almost as many bits of information in the digital universe as there are stars in
the actual universe?
(A) True
(B) False

Answer
A

MCQ No - 16
The word 'Big data' was coined by
(A) Roger Mougalas
(B) John Philips
(C) Simon Woods
(D) Martin Green

Answer
A

MCQ No - 17
The word 'Big Data' was coined in the year
(A) 2000
(B) 1970
(C) 1998
(D) 2005

Answer
C

MCQ No - 18
Concerning the Forms of Big Data, which one of these is odd?
(A) Structured
(B) Unstructured
(C) Processed
(D) Semi-Structured

Answer
C

MCQ No - 19
Big Data applications benefit the media and entertainment industry by
(A) Predicting what the audience wants

OptimusPrime Page 4
(B) Ad targeting
(C) Scheduling optimization
(D) All of the above

Answer
D

MCQ No - 20
The feature of big data that refers to the quality of the stored data is ______
(A) Variety
(B) Volume
(C) Variability
(D) Veracity

Answer
D

Question 1

What is the difference between interval/ratio and ordinal variables?

a) The distance between categories is equal across the range of interval/ratio data.

Question 2

What is the difference between a bar chart and a histogram?

c) There are no gaps between the bars on a histogram.

Question 3

What does the term 'outlier' mean?

d) An extreme value at either end of a distribution

Question 4

What is the function of a contingency table, in the context of bivariate analysis?

Correct answer:
b) It summarizes the frequencies of two variables so that they can be compared.

Question 5

If there were a perfect positive correlation between two interval/ratio variables, the
Pearson's r test would give a correlation coefficient of:
Correct answer:

OptimusPrime Page 5
b) +1

Question 6

What is the name of the test that is used to assess the relationship between two ordinal variables?

Correct answer:
a) Spearman's rho

Question 7

When might it be appropriate to conduct a multivariate analysis test?

Correct answer:
d) All of the above.

Question 8

What is meant by a "spurious" relationship between two variables?

Correct answer:
c) A relationship that appears to be true because each variable is related to a third one.

Question 9

A test of statistical significance indicates how confident the researcher is about:

Correct answer:
d) generalising their findings from the sample to the population.

Question 10

Setting the p level at 0.01 increases the chances of making a:

Correct answer:
b) Type II error

---------------------------------------------------------------------------------------------------------------------
SET 2 MCQs
---------------------------------------------------------------------------------------------------------------------

1. Data Analysis is a process of?

A. inspecting data
B. cleaning data
C. transforming data
D. All of the above
View Answer
Ans : D

OptimusPrime Page 6
2. Which of the following is not a major data analysis approaches?
A. Data Mining
B. Predictive Intelligence
C. Business Intelligence
D. Text Analytics
View Answer
Ans : B

3. How many main statistical methodologies are used in data analysis?

A. 2
B. 3
C. 4
D. 5
View Answer
Ans : A

4. In descriptive statistics, data from the entire population or a sample is summarized with ?
A. integer descriptors
B. floating descriptors
C. numerical descriptors
D. decimal descriptors
View Answer
Ans : C

5. Data Analysis is defined by the statistician?

A. William S.
B. Hans Peter Luhn
C. Gregory Piatetsky-Shapiro
D. John Tukey
View Answer
Ans : D

6. Which of the following is true about hypothesis testing?

7. The goal of business intelligence is to allow easy interpretation of large volumes of data to
identify new opportunities.

OptimusPrime Page 7
A. TRUE
B. FALSE
C. Can be true or false
D. Can not say
View Answer
Ans : A

8. The branch of statistics which deals with development of particular statistical methods is
classified as
A. industry statistics
B. economic statistics
C. applied statistics
D. applied statistics
View Answer
Ans : D

9. Which of the following is true about regression analysis?

10. Text Analytics, also referred to as Text Mining?

A. TRUE
B. FALSE
C. Can be true or false
D. Can not say
View Answer
Ans : A

---------------------------------------------------------------------------------------------------------------------
SET 3 MCQs
---------------------------------------------------------------------------------------------------------------------

Which type of test is the Wilcoxon rank sum test?

Answer
non-Parametric

Input data for Wilcoxon test is normally distributed, True or False?

Answer
False

What is the null hypothesis for a Wilcoxon test

Answer

OptimusPrime Page 8
Two group means are equal.

Which of following test statics is used in Wilcoxon Rank Sum Test?

Answer
test statistics <= critical value, Ho will be Rejected

What must you include when applying Wilcoxon Rank sum test?
Answer
“Critical Value”, “Rank sum”

Type 1 error is also called as

Answer
False Positive

Type 2 error is also called as

Answer
False negative

Type 1 error occurs when_____

Answer
Null hypothesis rejected when it is true.

Type 2 error occurs when_____

Answer
Null hypothesis is accepted when it is false

How to reduce Type 2 error?

Answer

By increasing sample size

Analysis of Variance is statistical method of comparing____of several populations
Answer
Means

ANOVA is used when____

Answer
If more than two population

What is Null Hypothesis in ANOVA?

Answer
all group means are equal

What does ANOVA calculate?

Answer
F ratio

OptimusPrime Page 9
What are the two types of variance which can occur in your data?
Answer
Between and within groups

If between group mean sum of square variability increases value of F statistics_____

Answer
Increases

What must you include when applying ANOVA test?

Answer
“Means”, “Critical Value”, “degree of freedom”, “F statistics”

How many dependent variables are there in a two-way ANOVA?

Answer
2

Which of following test statics is used in ANOVA?

Answer
if critical value < F ratio, Ho will be Rejected

Various types of ANOVA are___

Answer
“Two way ANOVA”, “ANCOVA”, “MANOVA”

Clustering is the classification of objects into different groups, True or False?

Answer
True

Clustering partition the data into k subsets, True or False

Answer
True

Clustering extracts the known patterns from the existing data, True or False?
Answer
False

Clustering techniques falls in the category of________

Answer
Unsupervised learning

K Means falls in the category of

Answer
Partitional Clustering

Agglomerative clustering is an ____ approach

Answer

OptimusPrime Page 10
bottom-up

Which of the following is determined by Distance Measure?

Answer
Similarity of two elements

K Means is _____
Answer
Centroid based method

Clustering is often used as a lead-in to classification, True or False?

Answer
True

K Means algorithm is iterative in nature, True or False?

Answer
True

In K Means algorithm, K is_____

Answer
Number of centers

WSS metric is the sum of the squares of the distances between each data point and the_____.
Answer
closest centroid

To use k-means properly, it is important to

Answer
All of the above

Once the clusters are identified, it is often useful to label them in a descriptive way.True or
False?
Answer
True

K-means is easily applied to ………………

Answer
Numeric attributes

Identification of highly correlated attributes is important for __.

Answer
reduction

K-means sensitive to starting seeds, True or False?

Answer
True

OptimusPrime Page 11
The process of identifying the appropriate value of k is referred to as finding the_____.
Answer
elbow

A _____ is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility
Answer
Decision tree

Decision Trees can be used for Classification Tasks.

Answer
True

Tree/Rule based classification algorithms generate … rule to perform the classification.

Answer
if-then

What is Decision Tree?

Answer
Flow-Chart & Structure in which internal node represents test on an attribute, each branch
represents outcome of test and each leaf node represents class label

What is the approach of basic algorithm for decision tree induction?

Answer
Greedy

High entropy means that the partitions in classification are

Answer
not pure

What is information gain of attribute?

Answer
It is a measure of purity

Gini index is used by________.

Answer
CART

How will you counter over-fitting in decision tree?

Answer
By pruning the longer rules

Calculate entropy of following equation?

Answer
0.9836

OptimusPrime Page 12
What is true about Data Visualization?
Answer
All of the above

Which one of the following is most basic and commonly used techniques?
Answer
“Line charts”

Common use cases for data visualization include?

Answer
“All of the above”

Which of the follwoing are Data Visulaization tool?

Answer
“All of above”

Which of the following is false?

Answer
“Data visualization decrease the insights and take solwer decisions”

Which among the following are the features of the Hadoop?

Answer
All of Above

Hadoop Framework is written in_____.

Answer
Java

Which of the following is component of Hadoop?

Answer
All of Above

As compared to RDBMS, Hadoop_____

Answer
works better on unstructured and semistructured data.

Which of the following is the daemon of Hadoop?

Answer
All of Above

Which type of data Hadoop can deal with is?

Answer
All of Above

When a client contacts the namenode for accessing a file, the namenode responds with____
Answer

OptimusPrime Page 13
Block Id and hostname of all the data nodes containing that block.

Which of the following is used for machine learning on Hadoop?

Answer
Mhoot

Zookeeper ensures that_________.

Answer
Only one namenode is actively serving the client requests are

The tables created in hive are stored as____.

Answer
a subdirectory under the database directory

---------------------------------------------------------------------------------------------------------------------
SET 4 MCQs
---------------------------------------------------------------------------------------------------------------------

1) What is Big Data?

a) Huge amount of data
b) Small amount of data
c) Huge File
d) Big Storage
Ans: a Explanation: It is Huge amount of data

2)
According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
Ans: a Explanation: Big data management and data mining

3)
What are the main components of Big Data?
a)MapReduce
b)HDFS
c)YARN
d)All of these
Ans: d Explanation: All of these

4)
The sources of Big Data are
a)Stock Exchange
b)Transport Data

OptimusPrime Page 14
c) Banking Data
d) All of the Above
Ans: d Explanation:

5)
Big Data Characteristics are:
a) Structured data
b) Semi-structured data
c) Quasi-structured data
d) All of the above
Ans: d Explanation:

6)
Bl tends to provide reports, dashboards, and queries on business questions for the current period
or in the past.
a) True
b) False
Ans: a Explanation:

7)
Big data can come in multiple forms, including structured and non-structured data
a) True
b) False
Ans: a Explanation:

8)
BI problems tend to require highly structured data organized
a) Rows
b) Columns
c) Accurate Reporting
d) All of the Above
Ans: d Explanation:

9)
EDW achieves the objective of reporting and sometimes the creation of dashboards, perform
analysis on unstructured data
a) High-value data is hard to reach and leverage
b) Data moves in batches from EDW to local analytical tools
c) Data Science projects will remain isolated
d) All of the Above
Ans: d Explanation:

10)
Drivers of Big Data
a) Medical information
b) Photos and video footage uploaded to the World Wide Web

OptimusPrime Page 15
c) data extracts
d) Both a and b
Ans: d Explanation:

11)
According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
Ans: a Explanation:

12)
Select from option which is not the phase of data analytics
a) model planning
b) testing
c) discovery
d) operationalize
Ans: b Explanation:

13)
Which phase of data analytics require more time to complete
a) Data preparation
b) model building
c) communicate results
d) Discovery
Ans: a Explanation:

14)
What is analytic sandbox?
a) Tool
b) Separate repository
c) data cleaning
d) Data conditioning
Ans: b Explanation:

15)
The person which provides analytic techniques and modeling is called as.
a) Data Engineer
b) Data scientist
c) Business user
d) Project manager
Ans: b Explanation:

16)

OptimusPrime Page 16
What is task of Project manager?
a) analytic modelling
b) Provide requirement
c) ensure meeting objectives
d) creates DB environment
Ans: c

17)
Identifying Key Stakeholders this task is performed in which phase?
a) Data preparation
b) model building
c) Discovery
d) communicate results
Ans: c Explanation:

18)
ETL process is performed in which phase
a) Discovery
b) communicate results
c) model planning
d) Data preparation
Ans: d Explanation:

19)
How much data Data science teams prefer for analysis?
a) too little
b) average
c) more
d) more than average
Ans: c Explanation:

20)
select from option tool which is not used in model planning phase
a) Data wrangler
b) R
c) SQL Analysis service
d) SAS/ACESS
Ans: c Explanation:

21)
if reports and dashboards will be impacted and need to change this task is performed by.
a) Project sponsor
b) BI Analyst
c) Data Engineer
d) Project manager
Ans: b Explanation:

OptimusPrime Page 17
22)
What is need of data analytic lifecycle.
a) Data cleaning
b) To solve Big data problems
c) Data conditioning
d) Data Exploration
Ans: b Explanation:

23)
How many phases are there in data analytic lifecycle?
a) 4
b) 5
c) 6
d) 7
Ans: c

24)
The person with technical skills is called as?
a) Business user
b) Data Engineer
c) Data scientist
d) Project sponsor
Ans: b

25)
What is outcome of Model building phase?
a) Analytic results
b) Quality data
c) Data
d) Potential resources
Ans: a

1) 1. A statement made about a population for testing purpose is called?

a) Statistic
b) Hypothesis
c) Level of Significance
d) Test-Statistic
Ans: b Explanation:

2)
If the assumed hypothesis is tested for rejection considering it to be true is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: a Explanation:

OptimusPrime Page 18
3)
A statement whose validity is tested on the basis of a sample is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: b Explanation:

4)
A hypothesis which defines the population distribution is called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
Ans: c Explanation:

5)
If the null hypothesis is false then which of the following is accepted?
a) Null Hypothesis
b) Positive Hypothesis
c) Negative Hypothesis
d) Alternative Hypothesis.
Ans: d Explanation:

6)
The rejection probability of Null Hypothesis when it is true is called as?
a) Level of Confidence
b) Level of Significance
c) Level of Margin
d) Level of Rejection Ans: b Explanation:
7)
The point where the Null Hypothesis gets rejected is called as?
a) Significant Value
b) Rejection Value
c) Acceptance Value
d) Critical Value
Ans: d Explanation:

8)
If the Critical region is evenly distributed then the test is referred as?
a) Two tailed
b) One tailed
c) Three tailed
d) Zero tailed
Ans: a Explanation:

OptimusPrime Page 19
9)
The type of test is defined by which of the following?
a) Null Hypothesis
b) Simple Hypothesis
c) Alternative Hypothesis
d) Composite Hypothesis
Ans: c Explanation:

10)
Which of the following is defined as the rule or formula to test a Null Hypothesis?
a) Test statistic
b) Population statistic
c) Variance statistic
d) Null statistic
Ans: a Explanation:

11)
Type 1 error occurs when?
a) We reject H0 if it is True
b) We reject H0 if it is False
c) We accept H0 if it is True
d) We accept H0 if it is False Ans: a Explanation:
12) The probability of Type 1 error is referred as?
a) 1-α
b) β
c) α
d) 1-β
Ans: c Explanation:

13)
Alternative Hypothesis is also called as?
a) Composite hypothesis
b) Research Hypothesis
c) Simple Hypothesis
d) Null Hypothesis
Ans: b Explanation:

14)
Which of the following is required by K-means clustering?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the mentioned
Ans: d Explanation:

15)

OptimusPrime Page 20
Point out the wrong statement.
a) k-means clustering is a method of vector quantization
b) k-means clustering aims to partition n observations into k clusters
c) k-nearest neighbor is same as k-means
d) none of the mentioned
Ans: c Explanation:

16)
Hierarchical clustering should be primarily used for exploration.
a) True
b) False
Ans: a Explanation:

17)
Which of the following function is used for k-means clustering?
a) k-means
b) k-mean
c) heatmap
d) none of the mentioned
Ans: a Explanation:

18)
Which of the following clustering requires merging approach?
a) Partitional
b) Hierarchical
c) Naive Bayes
d) None of the mentioned
Ans: b Explanation:

19)
K-means is not deterministic and it also consists of number of iterations.
a) True
b) False
Ans: a

20)
Depending on acceptance and rejection of null hypothesis there are 2 types of error produced
a) Type 1
b) Type 2
c) None of these
d) All of these
Ans: d

21)
The power of a test can be defined as a possibility of …
a) Rejecting null hypothesis

OptimusPrime Page 21
b) Accepting null hypothesis
c) Increasing null hypothesis
d) Decreasing null hypothesis
Ans: a

22)
For a fixed significance level, a greater sample size is mandatory to discover a
a) Minor difference in mean
b) Major difference in mean
c) Average difference in mean
d) None of the above
Ans: a

23)
ANNOVA tests if any of the population means vary from other population means
a) True
b) False
Ans: a

24)
Clustering is defined as group of same kind of objects which are gathered by use of
a) Unsupervised method
b) Supervised method
c) Semi supervised method
d) None of these
Ans: a

25)
Following are the applications of Kmeans
a) Image Processing
b) Medical
c) Customer Segmentation
d) All of the above
Ans: d

---------------------------------------------------------------------------------------------------------------------
SET 5 MCQs
---------------------------------------------------------------------------------------------------------------------

1. Data in ___________ bytes size is called Big Data.

A. Tera
B. Giga
C. Peta
D. Meta
View Answer
Ans : C

OptimusPrime Page 22
Explanation: data in Peta bytes i.e. 10^15 byte size is called Big Data.

2. How many V's of Big Data

A. 2
B. 3
C. 4
D. 5
View Answer
Ans : D
Explanation: Big Data was defined by the “3Vs” but now there are “5Vs” of Big Data which are
Volume, Velocity, Variety, Veracity, Value

3. Transaction data of the bank is?

A. structured data
B. unstructured datat
C. Both A and B
D. None of the above
View Answer
Ans : A
Explanation: Data which can be saved in tables are structured data like the transaction data of the
bank.

4. In how many forms BigData could be found?

A. 2
B. 3
C. 4
D. 5
View Answer
Ans : B
Explanation: BigData could be found in three forms: Structured, Unstructured and Semi-
structured.

5. Which of the following are Benefits of Big Data Processing?

A. Businesses can utilize outside intelligence while taking decisions
B. Improved customer service
C. Better operational efficiency
D. All of the above
View Answer
Ans : D
Explanation: All of the above are Benefits of Big Data Processing.

6. Which of the following are incorrect Big Data Technologies?

A. Apache Hadoop
B. Apache Spark
C. Apache Kafka
D. Apache Pytarch

OptimusPrime Page 23
View Answer
Ans : D
Explanation: Apache Pytarch is incorrect Big Data Technologies.

7. The overall percentage of the world’s total data has been created just within the past two years
is ?
A. 80%
B. 85%
C. 90%
D. 95%
View Answer
Ans : C
Explanation: The overall percentage of the world’s total data has been created just within the
past
two years is 90%.

8) Which of the following step is performed by data scientist after acquiring the data?
a) Data Cleansing
b) Data Integration
c) Data Replication
d) All of the mentioned
Ans: Data Cleansing

9)3V’s are not sufficient to describe big data.

a) True
b) False
Ans: True

10. Communicative and collaborative is one among the key skill sets and behavioral
characteristics of a
data scientist [True / False]?
a. True
b. False
Answer : a

11. ---------- are the sources of Bigdata [select all that apply]
I. Book
II. Facebook
III. Genome sequence
IV. Video Surveillance
Ans:

12. BI analyses the past data and make future predictions True/False ?
a. True
b. False
Answer : b

OptimusPrime Page 24
12. In which phase of data analytics ETLT is performed?
Ans: Phase 2 Data preparation is done in this phase. An analytical sandbox is used in this to
perform
analytics for the entire duration of the project. While you explore, preprocess and condition data,
modeling follows suit. To get the data into the sandbox, you will perform ETLT (extract,
transform, load
and transform).
A. Discovery
B. Model Planning
C. Model Building
D. Data Preparation

13. In which data analytics lifecycle phase is an analytic sandbox prepared?

Phase 2 — Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the
team
can work with data and perform analytics for the duration of the project. The team needs to
execute
extract, load, and transform (ELT) or extract, transform and load (ETL) to get data into the
sandbox.
A. Data Preparation
B. Model Planning
C. Model Building
D. Discovery

14. In which phase would the team expect to invest most of the project time?
A. Data Preparation
B. Model Planning
C. Model Building
D. Discovery

15. In which phase would the team expect to invest least time of the project time?
A. Data Preparation
B. Model Planning
C. Model Building
D. Discovery

16. from following tools which tool is used for Model building?
a. Hadoop b. Octave c. OpenRefine d. All of Above
Ans B

17. from following tools which tool is used for Data preparation
a. Alpine Miner b. Excel c. Matlab d.Weka
Ans . A

18. To determine if the project was completed on time and within budget, is the key role of
_____

OptimusPrime Page 25
A. Project Sponsor
B. Project Manager
C. Data Engineer
D. Data Scientist

19. How many Phases are there in Data Analytics Lifecycle?

A. 3
B. 6
C. 7
D. Any

20. In data Analytics life cycle we can move back and refine the work done. True or False
A. True
B. False

21. What are the key outputs from Analytics Projects?

A. PPT
B.report
C. code
D. All of above

22. ________ provides subject matter expertise for analytical techniques, data modeling and
applying
valid analytical techniques to give business problems.
A. Project Sponsor
B. Project Manager
C. Data Engineer
D. Data Scientist

1. A statement about a population developed for the purpose of testing is called:

(a) Hypothesis
(b) Hypothesis testing
(c) Level of significance
(d) Test-statistic
Answer : a

2. Any hypothesis which is tested for the purpose of rejection under the assumption that it is true
is
called:
(a) Null hypothesis
(b) Alternative hypothesis
(c) Statistical hypothesis

OptimusPrime Page 26
(d) Composite hypothesis
Answer : a

3. A statement that is accepted if the sample data provide sufficient evidence that the null
hypothesis is
false is called:
(a) Simple hypothesis
(b) Composite hypothesis
(c) Statistical hypothesis
(d) Alternative hypothesis
Answer : d

4. The alternative hypothesis is also called:

(a) Null hypothesis
(b) Statistical hypothesis
(c) Research hypothesis
(d) Simple hypothesis
Answer : c

5. The probability of rejecting the null hypothesis when it is true is called:

(a) Level of confidence
(b) Level of significance
(c) Power of the test
(d) Difficult to tell
Answer : b

6. If the critical region is located equally in both sides of the sampling distribution of test-
statistic, the
test is called:
(a) One tailed
(b) Two tailed
(c) Right tailed
(d) Left tailed
Answer : b

7. The choice of one-tailed test and two-tailed test depends upon:

(a) Null hypothesis
(b) Alternative hypothesis
(c) None of these
(d) Composite hypotheses
Answer : b

8. Test of hypothesis Ho: μ = 50 against H1: μ > 50 leads to:

(a) Left-tailed test
(b) Right-tailed test
(c) Two-tailed test

OptimusPrime Page 27
(d) Difficult to tell
Answer : b

9. Testing Ho: μ = 25 against H1: μ ≠ 25 leads to:

(a) Two-tailed test
(b) Left-tailed test
(c) Right-tailed test
(d) Neither (a), (b) and (c)
Answer : a

10. A formula that provides a basis for testing a null hypothesis is called:
(a) Test-statistic
(b) Population statistic
(c) Both of these
(d) None of the above
Answer : a

11. 1 – α is also called:

(a) Confidence coefficient
(b) Power of the test
(c) Size of the test
(d) Level of significance
Answer : a

12. Area of the rejection region depends on:

(a) Size of α
(b) Size of β
(c) Test-statistic
(d) Number of values
Answer : a

13. Student’s t-test is applicable only when:

(a) n≤30 and σ is known
(b) n>30 and σ is unknown
(c) n=30 and σ is known
(d) All of the above
Answer : a

14. In an unpaired samples t-test with sample sizes n1= 11 and n2= 11, the value of tabulated t
should be
obtained for:
(a) 10 degrees of freedom
(b) 21 degrees of freedom
(c) 22 degrees of freedom
(d) 20 degrees of freedom
Answer : d

OptimusPrime Page 28
15. The purpose of statistical inference is:
(a) To collect sample data and use them to formulate hypotheses about a population
(b) To draw conclusion about populations and then collect sample data to support the
conclusions (c) To
draw conclusions about populations from sample data
(d) To draw conclusions about the known value of population parameter
Answer : c

16. The histogram to the right represents the hospital length of stay (in days) for patients at a
nearby
medical facility. How many patients are included in the histogram?
a. 5
b. 21
c. 17
d. 9
Answer : b

17. Using the histogram to the right that represents the hospital lengths of stay (in days) for
patients at a
nearby medical facility, determine the relationship between the mean and the median.
a. Mean = Median
b. Mean ≈ Median
c. Mean < Median
d. Mean > Median
Answer : d

18. The statement “If there is sufficient evidence to reject a null hypothesis at the 10%
significance level, then there is sufficient evidence to reject it at the 5% significance level” :
Please select the best answer of those provided below.
a. Always True
b. Never True
c. Sometimes True; the p-value for the statistical test needs to be provided for a conclusion
d. Not Enough Information; this would depend on the type of statistical test used
Answer : c

19.Analysis of variance in short form is?

a) ANOV
b) AVA
c) ANOVA
d) ANVA
Ans:c

20) Which of the following is required by K-means clustering?

a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids

OptimusPrime Page 29
d) all of the mentioned
Ans: defined distance metric, number of clusters, initial guess as to cluster centroids

21) Hierarchical clustering should be primarily used for exploration.

a) True
b) False
Ans: True

22) Which of the following function is used for k-means clustering?

a) k-means
b) k-mean
c) heatmap
d) none of the mentioned
Ans: k-means

23) The goal of clustering a set of data is to

24) The k-means algorithm...

26) What are the two types of Hierarchical Clustering?

a)Top-Down Clustering (Divisive)
b)Bottom-Top Clustering (Agglomerative)
c)Dendrogram
d)K-means
Ans: Top-Down Clustering (Divisive), Bottom-Top Clustering (Agglomerative)

OptimusPrime Page 30
27) The most commonly used measure of similarity is the _____ or its square.
a)euclidean distance
b)city-block distance
c)Chebychev’s distance
d)Manhattan distance
Ans: euclidean distance

29) Which of the following is required by K-means clustering?

a)defined distance metric
b)number of clusters
c)initial guess as to cluster centroids
Ans: defined distance metric, number of clusters, initial guess as to cluster centroids

30) Clustering is a-
A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. None
Ans: Unsupervised learning

31) Which of the following clustering algorithms suffers from the problem of convergence at
local
optima?
A. K- Means clustering
B. Hierarchical clustering
C. Diverse clustering
D. All of the above
Ans: K- Means clustering, Hierarchical clustering, Diverse clustering

32) Which version of the clustering algorithm is most sensitive to outliers?

A. K-means clustering algorithm
B. K-modes clustering algorithm
C. K-medians clustering algorithm
D. None
Ans: K-means clustering algorithm

34) For clustering, we do not require-

A. Labeled data

OptimusPrime Page 31
B. Unlabeled data
C. Numerical data
D. Categorical data
Ans: Labeled Data

35) Which of the following is an application of clustering?

A. Biological network analysis
B. Market trend prediction
C. Topic modeling
D. All of the above
Ans: Biological network analysis, Market trend prediction, Topic modeling

36) The final output of Hierarchical clustering is-

37. Which type of test is the Wilcoxon rank sum test?

a. Parametric
b. non parametric
c. Distributed
d. Normal

38. Input data for Wilcoxon test is normally distributed,

True or False?

39. What is the null hypothesis for a Wilcoxon test?

a.Two group means are equal.
b.Two or more group means are equal.
c.Two mean groups are not equal.
d. None of these

40 Which of following test statics is used in Wilcoxon Rank Sum Test?

a. test statistics <= critical value, Ho will be Rejected
b. if test statistics > critical value, Ho will be Rejected
c. if test statistics >critical value, Ho will be accepted
d. none of these.

41. Type 1 is also called as

a. False Positive
b. false negative
c. True Positive
d. True negative

OptimusPrime Page 32
42. Type 2 is also called as
a. False Positive
b. False negative
c. True Positive
d. True negative

43. Type 1 error occurs when_____

a. Null hypothesis rejected when it is true.
b. Null hypothesis is accepted when it is false
c. Null hypothesis rejected when it is false
d. None of Above

44. Type 2 error occurs when_____

a. Null hypothesis rejected when it is true.
b. Null hypothesis is accepted when it is false
c. Null hypothesis rejected when it is false
d. All of above

44. Analysis of Variance is statistical method of comparing____of several populations.

a. Means
b. variance
c. standard Deviation
d. None of above.

45. ANOVA is used when____

a. If more than two population
b for two population
c. for Three population
d. for any populations

46. What is Null Hypothesis in ANOVA?

a. all group means are equal
b. Three group means are equal
c. atleast one pair of group means unequal.
d. all group means are unequal.

47. What do ANOVA calculate?

a. Z-score
b. F ratio
c. T-score
d. Chi Square

Q.25 What are the two types of variance which can occur in your data?
a. Independent and Dependent
b. Between and within groups
c. Personal and interpersonal

OptimusPrime Page 33
d. Anova and Anoca

Q.26 If between group mean sum of square variability increases value of F statistics_____
a. Increases
b. Decreases
c. Neutral
d. None of these

Q.27 What must you include when applying ANOVA test?

a. Means
b. Critical Value
c. degree of freedom
d. F statistics
e. All of above

Q.28 How many dependent variables are there in a two-way ANOVA?

a.1
b.3
c.2
d.any

Q.29 Which of following test statics is used in ANOVA?

a.if critical value > F ratio, Ho will be Rejected
b.if critical value < F ratio, Ho will be Rejected
c.if critical value > F ratio, Ho will be accepted
d.None of these

------------------------------------------------------------------------------------------ ---------------------------
SET 6 MCQs
---------------------------------------------------------------------------------------------------------------------

1.A collection of one or more items is called as _____

(A)Itemset
(B)Support
(C)Confidence
(D)Support Count
Ans:A

2.Frequency of occurrence of an itemset is called as _____

(A)Support
(B)Confidence
(C)Support Count
(D)Rules
Ans:C

3.An itemset whose support is greater than or equal to a minimum support threshold is ______

OptimusPrime Page 34
(A)Itemset
(B)Frequent Itemset
(C)Infrequent items
(D)Threshold values
Ans:B

4.What does FP growth algorithm do?

(A)It mines all frequent patterns through pruning rules with lesser support
(B)It mines all frequent patterns through pruning rules with higher support
(C)It mines all frequent patterns by constructing a FP tree
(D)It mines all frequent patterns by constructing an itemsets
Ans:C

5.What techniques can be used to improve the efficiency of apriori algorithm?

(A)Hash-based techniques
(B)Transaction Increases
(C)Sampling
(D)Cleaning
Ans:A

6. Linear Regression is a supervised machine learning algorithm.

A) TRUE
B) FALSE
Ans:A

7.It is possible to design a Linear regression algorithm using a neural network?

A) TRUE
B) FALSE
Ans:A

8.Which of the following methods do we use to find the best fit line for data in Linear
Regression?
A) Least Square Error
B) Maximum Likelihood
C) Logarithmic Loss
D) Both A and B
Ans:A

OptimusPrime Page 35
a) 0.3 and 0.4
b) 0.25 and 0.4
c) 0.25 and 0.15
d) 0.6 and 0.4
Ans: b

10.Which of the following implies no relationship with respect to correlation?

a) Cor(X, Y) = 1
b) Cor(X, Y) = 0
c) Cor(X, Y) = 2
d) All of the mentioned
Ans:b

11. If Linear regression model perfectly first i.e., train error is zero, then
_____________________
a) Test error is also always zero
b) Test error is non zero
c) Couldn’t comment on Test error
d) Test error is equal to Train error
Ans:C

12.Which of the following metrics can be used for evaluating regression models?
i) R Squared
ii) Adjusted R Squared
iii) F Statistics
iv) RMSE / MSE / MAE
a) ii and iv
b) i and ii
c) ii, iii and iv
d) i, ii, iii and iv
Ans:d

13.How many coefficients do you need to estimate in a simple linear regression model (One
independent variable)?
a) 1
b) 2
c) 3
d) 4
Ans:b

14.In a simple linear regression model (One independent variable), If we change the input
variable by 1 unit. How much output variable will change?
a) by 1
b) no change
c) by intercept
d) by its slope

OptimusPrime Page 36
Ans:d

15.Function used for linear regression in R is __________

a) lm(formula, data)
b) lr(formula, data)
c) lrm(formula, data)
d) regression.linear(formula, data)
Ans:a

16.In syntax of linear model lm(formula,data,..), data refers to ______

a) Matrix
b) Vector
c) Array
d) List
Ans:b

17.In the mathematical Equation of Linear Regression Y = β1 + β2X + ϵ, (β1, β2) refers to
__________
a) (X-intercept, Slope)
b) (Slope, X-Intercept)
c) (Y-Intercept, Slope)
d) (slope, Y-Intercept)
Ans:c

18. ________ is an incredibly powerful tool for analyzing data.

a) Linear regression
b) Logistic regression
c) Gradient Descent
d) Greedy algorithms
Ans:a

19.The square of the correlation coefficient r 2 will always be positive and is called the
________
a) Regression
b) Coefficient of determination
c) KNN
d) Algorithm
Ans:b

20.Predicting y for a value of x that’s outside the range of values we actually saw for x in the
original data is called ___________
a) Regression
b) Extrapolation
c) Intrapolation
d) Polation
Ans:b

OptimusPrime Page 37
21.What is predicting y for a value of x that is within the interval of points that we saw in the
original data called?
a) Regression
b) Extrapolation
c) Intrapolation
d) Polation
Ans:c

22. ________ is a simple approach to supervised learning. It assumes that the dependence of Y
on X1, X2, . . . Xp is linear.
a) Linear regression
b) Logistic regression
c) Gradient Descent
d) Greedy algorithms
Ans:a

23.Although it may seem overly simplistic, _______ is extremely useful both conceptually and
practically.
a) Linear regression
b) Logistic regression
c) Gradient Descent
d) Greedy algorithms
Ans:a

24. __________ refers to a group of techniques for fitting and studying the straight- line
relationship between two variables.
a) Linear regression
b) Logistic regression
c) Gradient Descent
d) Greedy algorithms
Ans:a

25. What do you mean by support(A)?

---------------------------------------------------------------------------------------------------------------------
SET 7 MCQs
---------------------------------------------------------------------------------------------------------------------

1. Data Analysis is a process of?

A. inspecting data

OptimusPrime Page 38
B. cleaning data
C. transforming data
D. All of the above
View Answer
Ans : D

2. Which of the following is not a major data analysis approaches?

A. Data Mining B. Predictive Intelligence C. Business Intelligence D. Text Analytics
View Answer
Ans : B

3. How many main statistical methodologies are used in data analysis?

A. 2 B. 3 C. 4 D. 5
View Answer
Ans : A

4. In descriptive statistics, data from the entire population or a sample is summarized with ?
A. integer descriptors B. floating descriptors C. numerical descriptors D. decimal descriptors
View Answer
Ans : C

5. Data Analysis is defined by the statistician?

A. William S. B. Hans Peter Luhn C. Gregory Piatetsky-Shapiro D. John Tukey
View Answer
Ans : D

6. Which of the following is true about hypothesis testing?

A. answering yes/no questions about the data B. estimating numerical characteristics of the data
C. describing associations within the data D. modeling relationships within the data
View Answer
Ans : A

7. The goal of business intelligence is to allow easy interpretation of large volumes of data to
identify new opportunities.
A. TRUE B. FALSE C. Can be true or false D. Can not say
View Answer
Ans : A

8. The branch of statistics which deals with development of particular statistical methods is
classified as
A. industry statistics B. economic statistics C. applied statistics D. applied statistics
View Answer
Ans : D

9. Which of the following is true about regression analysis?

A. answering yes/no questions about the data

OptimusPrime Page 39
B. estimating numerical characteristics of the data
C. modeling relationships within the data
D. describing associations within the data
View Answer
Ans : C

10. Text Analytics, also referred to as Text Mining?

A. TRUE B. FALSE C. Can be true or false D. Can not say
View Answer
Ans : A

1. What is a hypothesis?
a. A statement that the researcher wants to test through the data
collected in a study.
b. A research question the results will answer.
c. A theory that underpins the study.
d. A statistical method for calculating the extent to which the results
could have happened by chance.
Answer: a

2. Qualitative data analysis is still a relatively new and rapidly

developing branch of research methodology.
a. True
b. False
Answer: a

3.. The process of marking segments of data with symbols,

descriptive words, or category names is known as _______.
a. Concurring
b. Coding
c. Colouring
d. Segmenting
Answer: b

4. What is the cyclical process of collecting and analysing data

during a single research study called?
a. Interim analysis
b. Inter analysis
c. Inter-item analysis
d. Constant analysis
Answer: a

5. The process of quantifying data is referred to as _________.

a. Typology
b. Diagramming
c. Enumeration

OptimusPrime Page 40
d. Coding
Answer: c

6. An advantage of using computer programs for qualitative data is

that they _______.
a. Can reduce time required to analyse data (i.e., after the data are
transcribed)
b. Help in storing and organising data
c. Make many procedures available that are rarely done by hand
due to time constraints
d. All of the above
Answer: d

7. Boolean operators are words that are used to create logical

combinations.
a. True
b. False
Answer: a

8. __________ are the basic building blocks of qualitative data.

a. Categories
b. Units
c. Individuals
d. None of the above
Answer: a

9. This is the process of transforming qualitative research data from

written interviews or field notes into typed text.
a. Segmenting
b. Coding
c. Transcription
d. Mnemoning
Answer: c

10. A challenge of qualitative data analysis is that it often includes

data that are unwieldy and complex; it is a major challenge to make
sense of the large pool of data.
a. True
b. False
Answer: a

11. Hypothesis testing and estimation are both types of descriptive

statistics.
a. True
b. False
Answer: b

OptimusPrime Page 41
12. A set of data organised in a participants(rows)-by-
variables( columns) format is known as a “data set.”
a. True
b. False
Answer: a

13. A graph that uses vertical bars to represent data is called a ___
a. Line graph
b. Bar graph
c. Scatterplot
d. Vertical graph
Answer: b

14. ___________ are used when you want to visually examine the
relationship between two quantitative variables.
a. Bar graphs
b. Pie graphs
c. Line graphs
d. Scatterplots
Answer: d

15. The denominator (bottom) of the z-score formula is

a. The standard deviation
b. The difference between a score and the mean
c. The range
d. The mean
Answer: a

16. Which of these distributions is used for a testing hypothesis?

a. Normal Distribution
b. Chi-Squared Distribution
c. Gamma Distribution
d. Poisson Distribution
Answer b

17. A statement made about a population for testing purpose is

called?
a. Statistic
b. Hypothesis
c. Level of Significance
d. Test-Statistic
Answer: b

18. If the assumed hypothesis is tested for rejection considering it to

be true is called?
a. Null Hypothesis

OptimusPrime Page 42
b. Statistical Hypothesis
c. Simple Hypothesis
d. Composite Hypothesis
Answer: a

19. If the null hypothesis is false then which of the following is

accepted?
a. Null Hypothesis
b. Positive Hypothesis
c. Negative Hypothesis
d. Alternative Hypothesis.
Answer: d

20. Alternative Hypothesis is also called as?

a. Composite hypothesis
b. Research Hypothesis
c. Simple Hypothesis
d. Null Hypothesis
Answer: b

---------------------------------------------------------------------------------------------------------------------
SET 8 MCQs
---------------------------------------------------------------------------------------------------------------------

Q 1 - When a jobTracker schedules a task is first looks for

A - A node with empty slot in the same rack as datanode
B - Any node on the same rack as the datanode
C - Any node on the rack adjacent to rack of the datanode
D - Just any node in the cluster

Q 2 - The heartbeat signal are sent from

A - JObtracker to Tasktracker
B - Tasktracker to Job tracker
C - Jobtracker to namenode
D - Tasktracker to namenode

Q 3 - Job tracker runs on

A - Namenode
B - Datanode
C - Secondary namenode
D - Secondary datanode

Q 4 - Which of the following is not a scheduling option available in YARN

A - Balanced scheduler
B - Fair scheduler
C - Capacity scheduler

OptimusPrime Page 43
D - FiFO schesduler.

Q 5 - What is the default input format?

A - The default input format is xml. Developer can specify other input formats as appropriate if
xml is not the correct input.
B - There is no default input format. The input format always should be specified.
C - The default input format is a sequence file format. The data needs to be preprocessed before
using the default input format.
D - The default input format is TextInputFormat with byte offset as a key and entire line as a
value.

Q 6 - Which one is not one of the big data feature?

A - Velocity
B - Veracity
C - volume
D – variety

Q 7 - Which technology is used to store data in Hadoop?

A - HBase
B - Avro
C - Sqoop
D – Zookeeper

Q 8 - Which technology is used to serialize the data in Hadoop?

A - HBase
B - Avro
C - Sqoop
D – Zookeeper

Q 9 - Which technology is used to import and export data in Hadoop?

A - HBase
B - Avro
C - Sqoop
D – Zookeeper

Q 10 - Which of the following technologies is a document store database?

A - HBase
B - Hive
C - Cassandra
D – CouchDB

Q 11 - Which one of the following is not true regarding to Hadoop?

A - It is a distributed framework.
B - The main algorithm used in it is Map Reduce
C - It runs with commodity hard ware
D - All are true

OptimusPrime Page 44
Q 12 - Which one of the following stores data?
A - Name node
B - Data node
C - Master node
D - None of these

Q 13 - Which one of the following nodes manages other nodes?

A - Name node
B - Data node
C - slave node
D - None of these

Q 14 - What is AVRO?
A - Avro is a java serialization library.
B - Avro is a java compression library.
C - Avro is a java library that create split table files.
D - None of these answers are correct.

Q 15 - Can you run Map - Reduce jobs directly on Avro data?

A - Yes, Avro was specifically designed for data processing via Map-Reduce.
B - Yes, but additional extensive coding is required.
C - No, Avro was specifically designed for data storage only.
D - Avro specifies metadata that allows easier data access. This data cannot be used as part of
map-reduce execution, rather input specification only.

Q 16 - What is distributed cache?

A - The distributed cache is special component on name node that will cache frequently used
data for faster client response. It is used during reduce step.
B - The distributed cache is special component on data node that will cache frequently used data
for faster client response. It is used during map step.
C - The distributed cache is a component that caches java objects.
D - The distributed cache is a component that allows developers to deploy jars for Map-Reduce
processing.

Q 17 - What is writable?
A - Writable is a java interface that needs to be implemented for streaming data to remote
servers.
B - Writable is a java interface that needs to be implemented for HDFS writes.
C - Writable is a java interface that needs to be implemented for MapReduce processing.
D - None of these answers are correct.

Q 18 - What is HBASE?
A - Hbase is separate set of the Java API for Hadoop cluster.
B - Hbase is a part of the Apache Hadoop project that provides interface for scanning large
amount of data using Hadoop infrastructure.

OptimusPrime Page 45
D - HBase is a part of the Apache Hadoop project that provides a SQL like interface for data
processing.

Q 19 - How does Hadoop process large volumes of data?

A - Hadoop uses a lot of machines in parallel. This optimizes data processing.
B - Hadoop was specifically designed to process large amount of data by taking advantage of
MPP hardware.
C - Hadoop ships the code to the data instead of sending the data to the code.
D - Hadoop uses sophisticated caching techniques on name node to speed processing of data.

Q 20 - When using HDFS, what occurs when a file is deleted from the command line?
A - It is permanently deleted if trash is enabled.
B - It is placed into a trash directory common to all users for that cluster.
C - It is permanently deleted and the file attributes are recorded in a log file.
D - It is moved into the trash directory of the user who deleted it if trash is enabled.

Q 21 - When archiving Hadoop files, which of the following statements are true?
Choosetwoanswers
1. Archived files will display with the extension .arc.
2. Many small files will become fewer large files.
3. MapReduce processes the original files names even after files are archived.
4. Archived files must be UN archived for HDFS and MapReduce to access the
original, small files.
5. Archive is intended for files that need to be saved but no longer accessed by
HDFS.
A-1&3
B-2&3
C-2&4
D-3&4

Q 22 - When writing data to HDFS what is true if the replication factor is three?
Choose2answers
1. Data is written to DataNodes on three separate racks ifRackAware.
2. The Data is stored on each DataNode with a separate file which contains a
checksum value.
3. Data is written to blocks on three different DataNodes.
4. The Client is returned with a success upon the successful writing of the first
block and checksum check.
A-1&3
B-2&3
C-3&4
D-1&4

Q 23 - Which of the following are among the duties of the Data Nodes in HDFS?
A - Maintain the file system tree and metadata for all files and directories.
B - None of the options is correct.

OptimusPrime Page 46
C - Control the execution of an individual map task or a reduce task.
D - Store and retrieve blocks when told to by clients or the NameNode.
E - Manage the file system namespace.

Q 24 - Which of the following components retrieves the input splits directly from
HDFS to determine the number of map tasks?
A - The NameNode.
B - The TaskTrackers.
C - The JobClient.
D - The JobTracker.
E - None of the options is correct.

Q 25 - The org.apache.hadoop.io.Writable interface declares which two methods?

Choose2answers.
1. public void readFieldsDataInput.
2. public void readDataInput.
3. public void writeFieldsDataOutput.
4. public void writeDataOutput.
A-1&4
B-2&3
C-3&4
D-2&4

Q 26 - Which one of the following statements is true regarding <key,value> pairs of a

MapReduce job?
A - A key class must implement Writable.
B - A key class must implement WritableComparable.
C - A value class must implement WritableComparable.
D - A value class must extend WritableComparable.

Q 27 - Which one of the following statements is false regarding the Distributed Cache?
A - The Hadoop framework will ensure that any files in the Distributed Cache are distributed to
all
map and reduce tasks.
B - The files in the cache can be text files, or they can be archive files like zip and JAR files.
C - Disk I/O is avoided because data in the cache is stored in memory.
D - The Hadoop framework will copy the files in the Distributed Cache on to the slave node
before any tasks for the job are executed on that node.

Q 28 - Which one of the following is not a main component of HBase?

A - Region Server.
B - Nagios.
C - ZooKeeper.
D - Master Server.

Q 29 - Which of the following is false about RawComparator ?

OptimusPrime Page 47
A - Compare the keys by byte.
B - Performance can be improved in sort and suffle phase by using RawComparator.
C - Intermediary keys are deserialized to perform a comparison.

Q 30 - Which demon is responsible for replication of data in Hadoop?

A - HDFS.
B - Task Tracker.
C - Job Tracker.
D - Name Node.
E - Data Node.

Q 31 - Keys from the output of shuffle and sort implement which of the following
interface?
A - Writable.
B - WritableComparable.
C - Configurable.
D - ComparableWritable.
E - Comparable.

Q 32 - In order to apply a combiner, what is one property that has to be satisfied by

the values emitted from the mapper?
A - Combiner can be applied always to any data
B - Output of the mapper and output of the combiner has to be same key value pair and they can
be heterogeneous
C - Output of the mapper and output of the combiner has to be same key value pair. Only if the
values satisfy associative and commutative property it can be done.

Answer Key :

1A
2B
3A
4A
5D
6B
7A
8B
9C
10 D
11 D
12 B
13 A
14 A
15 A
16 B
17 C

OptimusPrime Page 48
18 B
19 C
20 C
21 B
22 C
23 D
24 D
25 A
26 B
27 C
28 B
29 C
30 D
31 B
32 C

---------------------------------------------------------------------------------------------------------------------
----------------

Q 1 - HDFS can be accessed over HTTP using

A - viewfs URI scheme
B - webhdfs URI scheme
C - wasb URI scheme
D - HDFS ftp

Q 2 - What is are true about HDFS?

A - HDFS filesystem can be mounted on a local client’s Filesystem using NFS.
B - HDFS filesystem can never be mounted on a local client’s Filesystem.
C - You can edit a existing record in HDFS file which is already mounted using NFS.
D - You cannot append to a HDFS file which is mounted using NFS.

Q 3 - The client reading the data from HDFS filesystem in Hadoop

A - gets the data from the namenode
B - gets the block location from the datanode
C - gets only the block locations form the namenode
D - gets both the data and block location from the namenode

Q 4 - Which scenario demands highest bandwidth for data transfer between nodes in
Hadoop?
A - Different nodes on the same rack
B - Nodes on different racks in the same data center.
C - Nodes in different data centers
D - Data on the same node.

Q 5 - The current block location of HDFS where data is being written to,
A - is visible to the client requesting for it.

OptimusPrime Page 49
B - Block locations are never visible to client requests.
C - May or may not be visible to the reader.
D - becomes visible only after the buffered data is committed.

Q 6 - Which of this is not a scheduler options available with YARN?

A - Optimal Scheduler
B - FIFO scheduler
C - Capacity scheduler
D - Fair scheduler

Q 7 - Which of the following is not a Hadoop operation mode?

A - Pseudo distributed mode
B - Globally distributed mode
C - Stand alone mode
D - Fully-Distributed mode

Q 8 - The difference between standalone and pseudo-distributed mode is

A - Stand alone cannot use map reduce
B - Stand alone has a single java process running in it.
C - Pseudo distributed mode does not use HDFS
D - Pseudo distributed mode needs two or more physical machines.

Q 9 - The hadoop frame work is written in

A - C++
B - Python
C - Java
D – GO

Q 10 - The hdfs command to create the copy of a file from a local system is
A - CopyFromLocal
B - copyfromlocal
C - CopyLocal
D – copyFromLocal

Q 11 - The hadfs command put is used to

A - Copy files from local file system to HDFS.
B - Copy files or directories from local file system to HDFS.
C - Copy files from from HDFS to local filesystem.
D - Copy files or directories from HDFS to local filesystem.

Q 12 - Underreplication in HDFS means-

A - No replication is happening in the data nodes.
B - Replication process is very slow in the data nodes.
C - The frequency of replication in data nodes is very low.
D - The number of replicated copies is less than as specified by the replication factor.

OptimusPrime Page 50
Q 13 - When the namenode finds that some blocks are over replicated, it
A - Stops the replication job in the entire hdfs file system.
B - It slows down the replication process for those blocks
C - It deletes the extra blocks.
D - It leaves the extra blocks as it is.

Q 14 - Which of the below property gets configured on core-site.xml ?

A - Replication factor
B - Directory names to store hdfs files.
C - Host and port where MapReduce task runs.
D - Java Environment variables.

Q 15 - Which of the below property gets configured on hdfs-site.xml ?

A - Replication factor
B - Directory names to store hdfs files.
C - Host and port where MapReduce task runs.
D - Java Environment variables.

Q 16 - Which of the below property gets configured on mapred-site.xml ?

A - Replication factor
B - Directory names to store hdfs files.
C - Host and port where MapReduce task runs.
D - Java Environment variables.

Q 17 - Which of the below property gets configured on hadoop-env.sh?

A - Replication factor
B - Directory names to store hdfs files
C - Host and port where MapReduce task runs
D - Java Environment variables.

Q 18 - The command to check if Hadoop is up and running is −

A - Jsp
B - Jps
C - Hadoop fs –test
D – None

Q 19 - The information mapping data blocks with their corresponding files is stored in
A - Data node
B - Job Tracker
C - Task Tracker
D – Namenode

Q 20 - The file in Namenode which stores the information mapping the data block
location with file name is −
A - dfsimage
B - nameimage

OptimusPrime Page 51
C - fsimage
D – image

Q 21 - The namenode knows that the datanode is active using a mechanism known as
A - heartbeats
B - datapulse
C - h-signal
D - Active-pulse

Q 22 - The nature of hardware for the namenode should be

A - Superior than commodity grade
B - Commodity grade
C - Does not matter
D - Just have more Ram than each of the data nodes

Q 23 - In Hadoop, Snappy and LZO are examples of

A - Mechanisms of file transport between data nodes
B - Mechanisms of data compression
C - Mechanisms of data Replication
D - Mechanisms of Data synchronization

Q 24 - Which of the below apache system deals with ingesting streaming data to
hadoop
A - Ozie
B - Kafka
C - Flume
D – Hive

Q 25 - The input split used in MapReduce indicates

A - The average size of the data blocks used as input for the program
B - The location details of where the first whole record in a block begins and the last whole
record in the block ends.
C - Splitting the input data to a MapReduce program into a size already configured in the
mapred-site.xml
D - None of these

Q 26 - The output of a mapper task is

A - The Key-value pair of all the records of the dataset.
B - The Key-value pair of all the records from the input split processed by the mapper
C - Only the sorted Keys from the input split
D - The number of rows processed by the mapper task.

Q 27 - The role of a Journal node is to

A - Report the location of the blocks in a data node
B - Report the edit log information of the blocks in the data node.
C - Report the Schedules when the jobs are going to run

OptimusPrime Page 52
D - Report the activity of various components handled by resource manager

Q 28 - The Zookeeper
A - Detects the failure of the namenode and elects a new namenode.
B - Detects the failure of datanodes and elects a new datanode.
C - Prevents the hardware from overheating by shutting them down.
D - Maintains a list of all the components IP address of the Hadoop cluster.

Q 29 - If the IP address or hostname of a datanode changes

A - The namenode updates the mapping between file name and block name
B - The namenode need not update mapping between file name and block name
C - The data in that data node is lost forever
D - There namenode has to be restarted

Q 30 - When a client contacts the namenode for accessing a file, the namenode
responds with
A - Size of the file requested.
B - Block ID of the file requested.
C - Block ID and hostname of any one of the data nodes containing that block.
D - Block ID and hostname of all the data nodes containing that block.

Q 31 - HDFS stands for

A - Highly distributed file system.
B - Hadoop directed file system
C - Highly distributed file shell
D - Hadoop distributed file system.

Q 32 - The Hadoop tool used for uniformly spreading the data across the data nodes is
named −
A - Scheduler
B - Balancer
C - Spreader
D – Reporter

Q 33 - In the secondary namenode the amount of memory needed is

A - Similar to that of primary node
B - Should be at least half of the primary node
C - Must be double of that of primary node
D - Depends only on the number of data nodes it is going to handle

Answer Key :
1B
2A
3C
4C
5D

OptimusPrime Page 53
6A
7B
8B
9C
10 D
11 B
12 D
13 C
14 B
15 A
16 C
17 D
18 B
19 D
20 C
21 A
22 A
23 B
24 C
25 B
26 B
27 B
28 A
29 B
30 D
31 D
32 B
33 A

---------------------------------------------------------------------------------------------------------------------
----------------

Q 1 - The purpose of checkpoint node in a Hadoop cluster is to

A - Check if the namenode is active
B - Check if the fsimage file is in sync between namenode and secondary namenode
C - Merges the fsimage and edit log and uploads it back to active namenode.
D - Check which data nodes are unreachable.

Q 2 - When a backup node is used in a cluster there is no need of

A - Check point node
B - Secondary name node
C - Secondary data node
D - Rack awareness

Q 3 - Rack awareness in name node means

A - It is aware how many racks are available in the cluster

OptimusPrime Page 54
B - It is aware of the mapping between the node and the rack
C - It is aware of the number of nodes in each of the rack
D - It is aware which data nodes are unavailable in the cluster.

Q 4 - When a machine is declared as a datanode, the disk space in it

A - Can be used only for HDFS storage
B - Can be used for both HDFS and non-HDFs storage
C - Cannot be accessed by non-hadoop commands
D - cannot store text files.

Q 5 - When a file in HDFS is deleted by a user

A - it is lost forever
B - It goes to trash if configured.
C - It becomes hidden from the user but stays in the file system
D - File sin HDFS cannot be deleted

Q 6 - The source of HDFS architecture in Hadoop originated as

A - Google distributed filesystem
B - Yahoo distributed filesystem
C - Facebook distributed filesystem
D - Azure distributed filesystem

Q 7 - The inter process communication between different nodes in Hadoop uses

A - REST API
B - RPC
C - RMI
D - IP Exchange

Q 8 - The type of data Hadoop can deal with is

A - Structred
B - Semi-structured
C – Unstructured
D - All of the above

Q 9 - YARN stands for

A - Yahoo’s another resource name
B - Yet another resource negotiator
C - Yahoo’s archived Resource names
D - Yet another resource need.

Q 10 - The fully distributed mode of installationwithoutvirtualization needs a minimum of

A - 2 physical mashines
B - 3 Physical machines
C - 4 Physical machines
D - 1 Physical machine

OptimusPrime Page 55
Q 11 - Running Start-dfs.sh results in
A - Starting namenode and datanode
B - Starting namenode only
C - Starting datanode only
D - Starting namenode and resource manager

Q 12 - Which of the following is not a goal of HDFS?

A - Fault detection and recovery
B - Handle huge dataset
C - Prevent deletion of data
D - Provide high network bandwidth for data movement

Q 13 - The command “hadoop fs -test -z URI “ gives the result 0 if

A - if the path is a directory
B - if the path is a file
C - if the path is not empty
D - if the file is zero length

Q 14 - In HDFS the files cannot be

A - read
B - deleted
C - executed
D – Archived

Q 15 - hadoop fs –expunge
A - Gives the list of datanodes
B - Used to delete a file
C - Used to exchange a file between two datanodes.
D - Empties the trash.

Q 16 - All the files in a directory in HDFS can be merged together using

A - getmerge
B - putmerge
C - remerge
D – mergeall

Q 17 - The replication factor of a file in HDFS can be changed using

A - changerep
B - rerep
C - setrep
D – xrep

Q 18 - The comman used to copy a directory form one node to another in HDFS is
A - rcp
B - dcp
C - drcp

OptimusPrime Page 56
D – distcp

Q 19 - The archive file created in Hadoop always has the extension of

A - .hrc
B - .har
C - .hrh
D - .hrar

Q 20 - To unarchive an already archived file in haddop use the command

A - unrar
B - unhar
C - cp
D – cphar

Q 21 - The data from a remote hadoop cluster can

A - not be read by another hadoop cluster
B - be read using http
C - be read using hhtp
D - be read suing hftp

Q 22 - The purpose of starting namenode in the recovery mode is to

A - Recover a failed namenode
B - Recover a failed datanode
C - Recover data from one of the metadata storage locations
D - Recover data when there is only one metadata storage location

Q 23 - When you increase the number of files stored in HDFS, The memory required by
namenode
A - Increases
B - Decreases
C - Remains unchanged
D - May increase or decrease

Q 24 - If we increase the size of files stored in HDFS without increasing the number of
files, then the memory required by namenode
A - Decreases
B - Increases
C - Remains unchanged
D - May or may not increase

Q 25 - The current limiting factor to the size of a hadoop cluster is

A - Excess heat generated in data center
B - Upper limit of the network bandwidth
C - Upper limit of the RAM in namenode
D - 4000 data nodes

OptimusPrime Page 57
Q 26 - The decommission feature in hadoop is used for
A - Decommissioning the namenode
B - Decommissioning the data nodes
C - Decommissioning the secondary namenode.
D - Decommissioning the entire Hadoop cluster.

Q 27 - You can reserve the amount of disk usage in a data node by configuring the
dfs.datanode.du.reserved in which of the following file
A - Hdfs-site.xml
B - Hdfs-defaukt.xml
C - Core-site.xml
D - Mapred-site.xml

Q 28 - The namenode loses its only copy of fsimage file. We can recover this from
A - Datanodes
B - Secondary namenode
C - Checkpoint node
D – Never

Q 29 - In a HDFS system with block size 64MB we store a file which is less than 64MB.
Which of the following is true?
A - The file will consume 64MB
B - The file will consume more than 64MB
C - The file will consume less than 64MB.
D - Can not be predicted.

Q 30 - A running job in hadoop can

A - Be killed with a command
B - Can never be killed with a command
C - Can be killed only by shutting down the name node
D - Be paused and run again

Q 31 - The number of tasks a task tracker can accept depends on

A - Maximum memory available in the node
B - Not limited
C - Number of slots configured in it
D - As decided by the jobTracker

Q 32 - When a jobTracker schedules a task is first looks for

A - A node with empty slot in the same rack as datanode
B - Any node on the same rack as the datanode
C - Any node on the rack adjacent to rack of the datanode
D - Just any node in the cluster

Q 33 - The heartbeat signal are sent from

A - JObtracker to Tasktracker

OptimusPrime Page 58
B - Tasktracker to Job tracker
C - Jobtracker to namenode
D - Tasktracker to namenode

Answer Key :

1C
2A
3B
4B
5B
6A
7B
8D
9B
10 A
11 A
12 C
13 D
14 C
15 D
16 A
17 C
18 D
19 B
20 C
21 D
22 D
23 A
24 A
25 C
26 B
27 A
28 C
29 C
30 A
31 C
32 A
33 B

---------------------------------------------------------------------------------------------------------------------
------------------------------------------------

Q 1 - The concept using multiple machines to process data stored in distributed

system is not new.
The High-performance computing HPC uses many computing machines to process

OptimusPrime Page 59
large volume of data stored in a storage area network SAN. As compared to HPC,
Hadoop
A - Can process a larger volume of data.
B - Can run on a larger number of machines than HPC cluster.
C - Can process data faster under the same network bandwidth as compared to HPC.
D - Cannot run compute intensive jobs.

Q 2 - Hadoop differs from volunteer computing in

A - Volunteers donating CPU time and not network bandwidth.
B - Volunteers donating network bandwidth and not CPU time.
C - Hadoop cannot search for large prime numbers.
D - Only Hadoop can use mapreduce.

Q 3 - As compared to RDBMS, Hadoop

A - Has higher data Integrity.
B - Does ACID transactions
C - IS suitable for read and write many times
D - Works better on unstructured and semi-structured data.

Q 4 - What is the main problem faced while reading and writing data in parallel from
multiple disks?
A - Processing high volume of data faster.
B - Combining data from multiple disks.
C - The software required to do this task is extremely costly.
D - The hardware required to do this task is extremely costly.

Q 5 - Which of the following is true for disk drives over a period of time?
A - Data Seek time is improving faster than data transfer rate.
B - Data Seek time is improving more slowly than data transfer rate.
C - Data Seek time and data transfer rate are both increasing proportionately.
D - Only the storage capacity is increasing without increase in data transfer rate.

Q 6 - Data locality feature in Hadoop means

A - store the same data across multiple nodes.
B - relocate the data from one node to another.
C - co-locate the data with the computing nodes.
D - Distribute the data across multiple nodes.

Q 7 - Which of these provides a Stream processing system used in Hadoop ecosystem?

A - Solr
B - Tez
C - Spark
D – Hive

Q 8 - HDFS files are designed for

A - Multiple writers and modifications at arbitrary offsets.

OptimusPrime Page 60
B - Only append at the end of file
C - Writing into a file only once.
D - Low latency data access.

Q 9 - A file in HDFS that is smaller than a single block size

A - Cannot be stored in HDFS.
B - Occupies the full block's size.
C - Occupies only the size it needs and not the full block.
D - Can span over multiple blocks.

Q 10 - HDFS block size is larger as compared to the size of the disk blocks so that
A - Only HDFS files can be stored in the disk used.
B - The seek time is maximum
C - Transfer of a large files made of multiple disk blocks is not possible.
D - A single file larger than the disk size can be stored across many disks in the cluster.

Q 11 - In a Hadoop cluster, what is true for a HDFS block that is no longer available
due to disk corruption or machine failure?
A - It is lost for ever
B - It can be replicated form its alternative locations to other live machines.
C - The namenode allows new client request to keep trying to read it.
D - The Mapreduce job process runs ignoring the block and the data stored in it.

Q 12 - Which utility is used for checking the health of a HDFS file system?
A - fchk
B - fsck
C – fsch
D – fcks

Q 13 - Which command lists the blocks that make up each file in the filesystem.
A - hdfs fsck / -files -blocks
B - hdfs fsck / -blocks -files
C - hdfs fchk / -blocks -files
D - hdfs fchk / -files –blocks

Q 14 - The datanode and namenode are respectiviley

A - Master and worker nodes
B - Worker and Master nodes
C - Both are worker nodes
D – None

Q 15 - In the local disk of the namenode the files which are stored persistently are −
A - namespace image and edit log
B - block locations and namespace image
C - edit log and block locations
D - Namespace image, edit log and block locations.

OptimusPrime Page 61
Q 16 - When a client communicates with the HDFS file system, it needs to
communicate with
A - only the namenode
B - only the data node
C - both the namenode and datanode
D - None of these

Q 17 - What mechanisms Hadoop uses to make namenode resilient to failure.

A - Take backup of filesystem metadata to a local disk and a remote NFS mount.
B - Store the filesystem metadata in cloud.
C - Use a machine with at least 12 CPUs
D - Using expensive and reliable hardware.

Q 18 - The main role of the secondary namenode is to

A - Copy the filesystem metadata from primary namenode.
B - Copy the filesystem metadata from NFS stored by primary namenode
C - Monitor if the primary namenode is up and running.
D - Periodically merge the namespace image with the edit log.

Q 19 - For the frequently accessed HDFS files the blocks are cached in
A - the memory of the datanode
B - in the memory of the namenode
C - Both A&B
D - In the memory of the client application which requested the access to these files.

Q 20 - User applications can instruct the namenode to cache the files by

A - adding cache file names to cache pool
B - adding cache config to cache pool
C - adding cache directive to cache pool
D - passing the file names as parameters to the cache pool

Q 21 - In Hadoop 2.x release HDFS federation means

A - Allowing namenodes to communicate with each other.
B - Allow a cluster to scale by adding more datanodes under one namenode.
C - Allow a cluster to scale by adding more namenodes.
D - Adding more physical memory to both namenode and datanode.

Q 22 - Under HDFS federation

A - Each namenode manages metadata of the entire filesystem.
B - Each namenode manages metadata of a portion of the filesystem.
C - Failure of one namenode causes loss of some metadata availability from the entire
filesystem.
D - Each datanode registers with each namenode.

Q 23 - The main goal of HDFS High availability is

OptimusPrime Page 62
A - Faster creation of the replicas of primary namenode.
B - To reduce the cycle time required to bring back a new primary namenode after existing
primary fails.
C - Prevent data loss due to failure of primary namenode.
D - Prevent the primary namenode form becoming single point of failure.

Q 24 - As part of the HDFS high availability a pair of primary namenodes are

configured. What is true for them?
A - When a client request comes, one of them chosen at random serves the request.
B - One of them is active while the other one remains powered off.
C - Datanodes send block reports to only one of the namenodes.
D - The standby node takes periodic checkpoints of active namenode’s namespace.

Q 25 - Zookeeper ensures that

A - All the namenodes are actively serving the client requests
B - Only one namenode is actively serving the client requests
C - A failover is triggered when any of the datanode fails.
D - A failover can not be started by hadoop administrator.

Q 26 - Under Hadoop High Availability, Fencing means

A - Preventing a previously active namenode from start running again.
B - Preventing the start of a failover in the event of network failure with the active namenode.
C - Preventing the power down to the previously active namenode.
D - Preventing a previously active namenode from writing to the edit log.

Q 27 - Which of the following is not a fencing mechanism for a previously active

namenode?
A - Disabling its network port via a remote management command.
B - Revoking its access to shared storage directory.
C - Formatting its disk drive.
D – STONITH

Q 28 - The property used to set the default filesystem for Hadoop in core -site.xml is-
A - filesystem.default
B - fs.default
C - fs.defaultFS
D - hdfs.default

Q 29 - The default replication factor for HDFS file system in hadoop is

A-1
B-2
C-3
D–4

Q 30 - When running on a pseudo distributed mode the replication factor is set to

A-2

OptimusPrime Page 63
B-1
C-0
D–3

Q 31 - For a HDFS directory the replication factorRF is

A - same as the RF of the files in that directory
B - Zero
C-3
D - Does not apply.

Q 32 - The following is not permitted on HDFS files

A - Deleting
B - Renaming
C - Moving
D - Executing.

Answer Key :

1C
2A
3D
4B
5B
6C
7C
8B
9C
10 D
11 B
12 B
13 A
14 B
15 A
16 C
17 A
18 D
19 A
20 C
21 C
22 B
23 B
24 D
25 B
26 D
27 C
28 B

OptimusPrime Page 64
29 C
30 B
31 D
32 D

OptimusPrime Page 65
S.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
Which of the following is not an example of Social Media?
a. Twitter
1. b. Google
c. Insta
d. Youtube
By 2025, the volume of digital data will increase to
a. TB
2. b. YB
c. ZB
d. EB
For Drawing insights for Business what are need?
a. Collecting the data
3. b. Storing the data
c. Analysing the data
d. All the above
Does Facebook uses "Big Data " to perform the concept of Flashback? Is this True or
4.
False.
a. TRUE
b. FALSE
The Process of describing the data that is huge and complex to store and process is known
as
a. Analytics
5.
b. Data mining
c. Big Data
d. Data Warehouse
Data generated from online transactions is one of the example for volume of big data. Is
6.
this true or False.
a. TRUE
b. FALSE
Velocity is the speed at which the data is processed
7. a. TRUE
b. FALSE
have a structure but cannot be stored in a database.
a. Structured
8. b. Semi-Structured
c. Unstructured
d. None of these
refers to the ability to turn your data useful for business.
a. Velocity
9. b. Variety
c. Value
d. Volume

OptimusPrime Page 66
Value tells the trustworthiness of data in terms of quality and accuracy.
10. a. TRUE
b. FALSE
GFS consists of a Master and Chunk Servers
a. Single, Single
11. b. Multiple, Single
c. Single, Multiple
d. Multiple, Multiple
Files are divided into sized Chunks.
a. Static
12. b. Dynamic
c. Fixed
d. Variable
is an open source framework for storing data and running application on
clusters of commodity hardware.
a. HDFS
13.
b. Hadoop
c. MapReduce
d. Cloud
HDFS Stores how much data in each clusters that can be scaled at any time?
a. 32
14. b. 64
c. 128
d. 256
Hadoop MapReduce allows you to perform distributed parallel processing on large
volumes of data quickly and efficiently… is this MapReduce or Hadoop… i.e statement is
15. True or False
a. TRUE
b. FALSE
Hortonworks was introduced by Cloudera and owned by Yahoo.
16. a. TRUE
b. FALSE
Hadoop YARN is used for Cluster Resource Management in Hadoop Ecosystem.
17. a. TRUE
b. FALSE
Google Introduced MapReduce Programming model in 2004.
18. a. TRUE
b. FALSE
phase sorts the data & creates logical clusters.
a. Reduce, YARN
b. MAP, YARN
19.
c. REDUCE, MAP
d. MAP, REDUCE

OptimusPrime Page 67
There is only one operation between Mapping and Reducing is it True or False…
a. TRUE
20.
b. FALSE

is factors considered before Adopting Big Data Technology.

OptimusPrime Page 68
is a programming model for writing applications that can process Big
Data in parallel on multiple nodes.
a. HDFS
28. b. MAP REDUCE
c. HADOOP
d. HIVE

takes the grouped key-value paired data as input and runs a

Reducer function on each one of them.
a. MAPPER
29. b. REDUCER
c. COMBINER
d. PARTITIONER

is a type of local Reducer that groups similar data from the map phase
into identifiable sets.
a. MAPPER
30. b. REDUCER
c. COMBINER
d. PARTITIONER

While Installing Hadoop how many xml files are edited and list them ?
i. core-site.xml
ii. hdfs-site.xml
31.
iii. mapred.xml
iv. yarn.xml

Write the code for core-site.xml ?

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

</?xml >
33. Write the code for hdfs-site.xml ?

OptimusPrime Page 69
S.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
Movie Recommendation systems are an example of
1. Classification 2. Clustering 3. Reinforcement Learning 4. Regression
a. 2 Only
1.
b. 1 and 2
c. 1 and 3
d. 2 and 3
Sentiment Analysis is an example of
1. Regression 2. Classification 3. Clustering 4 Reinforcement Learning
a. 1, 2 and 4
2.
b. 1 and 3
c. 1, 2 and 3
d. 1 and 2
Can decision trees be used for performing clustering?
3. a. True
b. False
What is the minimum no. of variables/ features required to perform clustering?
1. 0
4. 2. 1
3. 2
4. 3
For two runs of K-Mean clustering is it expected to get same clustering results?
5. 1. Yes
2. No
Which of the following can act as possible termination conditions in K-Means?
1. For a fixed number of iterations.
2. Assignment of observations to clusters does not change between iterations. Except for
cases with a bad local minimum.
3. Centroids do not change between successive iterations. 4.Terminate when RSS falls
6.
below a threshold.
a. 1, 3 and 4
b. 1, 2 and 3
c. 1, 2 and 4
d. All of the above
Which of the following algorithm is most sensitive to outliers?
1. K-means clustering algorithm
7. 2. K-medians clustering algorithm
3. K-modes clustering algorithm
4. K-medoids clustering algorithm
After performing K-Means Clustering analysis on a dataset, you observed the following
8.
dendrogram. Which of the following conclusion can be drawn from the dendrogram?

OptimusPrime Page 70
a. There were 28 data points in clustering analysis
b. The best no. of clusters for the analyzed data points is 4
c. The proximity function used is Average-link clustering
d. The above dendrogram interpretation is not possible for K-Means clustering
analysis
In the figure below, if you draw a horizontal line on y- axis for y=2. What will be the
number of clusters formed?

OptimusPrime Page 71
Bayesian classifiers is
1. A class of learning algorithm that tries to find an optimum classification of a set of
examples using the probabilistic theory.
2. Any mechanism employed by a learning system to constrain the search space of a
12. hypothesis
3. An approach to the design of learning algorithms that is inspired by the fact that when
people encounter new situations, they often explain them by reference to familiar
experiences, adapting the explanations to fit the new situation.
4. None of these
Classification accuracy is
1. A subdivision of a set of examples into a number of classes
2. Measure of the accuracy, of the classification of a concept that is given by a
13.
certain theory
3. The task of assigning a classification to a set of examples
4. None of these
Classification task referred to
1. A subdivision of a set of examples into a number of classes
2. A measure of the accuracy, of the classification of a concept that is given by a
14.
certain theory
3. The task of assigning a classification to a set of examples
4. None of these
Euclidean distance measure is
1. A stage of the KDD process in which new data is added to the existing selection.
2. The process of finding a solution for a problem simply by enumerating all possible
15.
solutions according to some pre-defined order and then testing them
3. The distance between two points as calculated using the Pythagoras theorem
4. None of these
is good at handle missing data and support both the kind of
attributes ( i.e Categorial and Continuous attributes )
a. ID3.
16.
b. C4.5.
c. CART.
d. Naïve Bayes.
Decision trees use , in that they always choose the option
that seems the best available at that moment.
a. Greedy Algorithms.
17.
b. Divide and Conquer.
c. Backtracking.
d. Shortest Path Method.
Decision trees cannot handle categorical attributes with many distinct values, such as
country codes for telephone numbers.
18.
a. TRUE
b. FALSE
19. are easy to implement and can execute efficiently even without

OptimusPrime Page 72
prior knowledge of the data, they are among the most popular algorithms for classifying
text documents.
a. ID3
b. Naïve Bayes classifiers
c. CART
d. None of these.
High entropy means that the partitions in classification are
a. Pure
20. b. Not pure
c. Useful
d. Useless
Which of the following statements about Naive Bayes is incorrect?
a. Attributes are equally important.
21. b. Attributes are statistically dependent of one another given the class value.
c. Attributes are statistically independent of one another given the class value.
d. Attributes can be nominal or numeric
The maximum value for entropy depends on the number of classes so if we have 8 Classes
what will be the max entropy.

OptimusPrime Page 73
Can we use K Mean Clustering to identify the objects in video?
26. 1. Yes
2. No
Clustering techniques are in the sense that the data scientist
does not determine, in advance, the labels to apply to the clusters.
1. Unsupervised
27.
2. Supervised
3. Reinforcement
4. Neural network

S.
Objective Questions (MCQ /True or False / Fill up with Choices )
No.
metric is examined to determine a reasonably optimal value of
k.
1. Mean Square Error
1.
2. Within Sum of Squares (WSS)
3. Speed
4. None of These
If an itemset is considered frequent, then any subset of the frequent itemset must also be
frequent.
1. Apriori Property
2.
2. Downward Closure Property
3. Either 1 or 2
4. Both 1 & 2
if {bread,eggs,milk} has a support of 0.15 and {bread,eggs} also has a support of 0.15, the
confidence of rule {bread,eggs}→{milk} is
1. 0
3.
2. 1
3. 2
4. 3
Confidence is a measure of how X and Y are really related rather than coincidentally
happening together.
4.
a. True
b. False
A high-confidence rule can sometimes be misleading because confidence does not consider
support of the itemset in the rule consequent. Is This True ?
5.
a. Yes
b. No
recommend items based on similarity measures between users and/or
items.
1. Content Based Systems
6.
2. Hybrid System
3. Collaborative Filtering Systems
4. None of These

OptimusPrime Page 74
There are major Classification of Collaborative Filtering Mechanisms
1. 1
7. 2. 2
3. 3
4. None of These
Movie Recommendation to peoples is an example of
1. User Based Recommendation
8. 2. Item Based Recommendation
3. Knowledge Based Recommendation
4. Content Based Recommendation
recommenders rely on an explicitly defined set of recommendation
rules.
1. Constraint Based
9.
2. Case Based
3. Content Based
4. User Based
Parallelized hybrid recommender systems operate dependently of one another and produce
separate recommendation lists.
10.
1. True
2. False
Association rules are sometimes referred to as
a. market basket analysis
11. b. Itemset Filtering
c. Frequent Itemset Analysis
d. None of these.
if 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8.
Similarly, if 60% of all transactions contain itemset {bread,butter}, then the support of
{bread,butter} is
12. a. 0.4
b. 0.5
c. 0.6
d. 0.7
Lift is defined as the measure of certainty or trustworthiness associated with each
discovered rule.
13.
a. TRUE
b. FALSE
is able to identify trustworthy rules, but it cannot tell whether a rule is
coincidental.
a. Lift
14.
b. Confidence
c. Support
d. Leverage

OptimusPrime Page 75
recommend items based on similarity measures between users
and/or items. The items recommended to a user are those preferred by similar users.
a. Collaborative Filtering System
15.
b. Content Based Recommendation
c. Knowledge Based Recommendation
d. Hybrid Approaches
Pure collaborative approaches take a matrix of given user–item ratings as the only input
and typically produce output. Is it Pure Collaborative?
16.
a. Yes
b. No
With respect to the determination of the set of similar users, one common measure used in
17.
recommender systems is
a. Cosine Similarity Measure
b. Pearson’s correlation coefficient.
c. Mean Squared Error Method
d. None of these.
Large-scale e-commerce sites, often implement a different technique,
which is more apt for offline preprocessing and thus allows for
the computation of recommendations in real time even for a very large rating matrix.
18. a. Item-Based Recommendation
b. User-Based Recommendation
c. Content-Based Recommendation
d. None of these
Here are two very short texts to compare and find the cosine similarity measure?
I. Julie loves me more than Linda loves me
II. Jane likes me more than Julie loves me
19. a. 0.6
b. 0.7
c. 0.8
d. 0.9
is based on the availability of item descriptions and a profile that
assigns importance to these characteristics.
a. Item-Based Recommendation
20.
b. User-Based Recommendation
c. Content-Based Recommendation.
d. None of these
Consider the features of a movie which are not relevant to a recommendation system.
a. The set of actors of the movie.
21. b. The Director
c. The Year in which the movie was made
d. The Budget of the movie.

OptimusPrime Page 76
A has been implemented, for similarity based retrieval under
nearest neighbors.
a. k-nearest-neighbor method (kNN)
22.
b. Conventional Neural Network (CNN)
c. Bayes Theorem
d. Naïve Bayes Classifier
Case-based recommenders focus on the retrieval of similar items on the basis of different
types of similarity measures
23.
a. TRUE
b. FALSE
In recommendation approaches, items are retrieved using similarity
measures that describe to which extent item properties match some given user’s
24. requirements.
a. Item-Based
b. Case-Based
c. Content-Based
d. User-Based
are based on a sequenced order of techniques, in which each succeeding
recommender only refines the recommendations of its predecessor.
a. Weighted Hybrids
25.
b. Mixed Hybrids
c. Cascade Hybrids
d. Switching Hybrids
require an oracle that decides which recommender should be
used in a specific situation, depending on the user profile and/or the quality of
recommendation
26. a. Weighted Hybrids
b. Mixed Hybrids
c. Cascade Hybrids
d. Switching Hybrids

OptimusPrime Page 77
No
Question a b c d ANS
.
Eg a/b/c/
Write down question Option a Option b Option c Option d
. d
Business intelligence (BI) is a broad
category a) Decision d) All of the
1 b) Data mining c) OLAP d
of application programs which support mentioned
includes _____________
a) Distinguish
the b) Rank c) Ranks
BI can catalyze a business’s success products and customers and customers and d) All of the
2 d
in terms of _____________ services locations based locations based mentioned
that drive on profitability on probability
revenues
Which of the following areas are d) All of the
3 a) Revenue b) CRM c) Sales b
affected by BI? mentioned
________ is a performance management
tool that recapitulates an organization’s a) Balanced d) All of the
4 b) Data Cube c) Dashboard a
performance from several standpoints Scorecard mentioned
on a single page.
__________ is a system where operations
a) Data b) Data d) None of the
5 like data extraction, transformation and c) ETL a
staging integration mentioned
loading operations are executed.
_________ is a category of applications
and a) Data d) All of the
6 b) MIS c) EIS c
technologies for presenting and analyzing warehouse mentioned
corporate and external data.
Which of the following is the process of a)
basing an organization’s actions and Institutional c) Slice and d) None of the
7 b) Gap analysis a
decisions performance Dice mentioned
on actual measured results of performance? management
Which of the following does not form part
8 a) SSRS b) SSIS c) SSAS d) OBIEE d
of BI Stack in SQL Server?
a) Distinguish
the b) Rank c) Ranks
BI can catalyze a business’s success products and customers and customers and d) All of the
9 d
in terms of _____________ services that locations based locations based mentioned
drive on profitability on probability
revenues
This is an approach to selling goods and
A. customer
services in which C. permission D. one-to-one
10 managed B. data mining c
a prospect explicitly agrees in advance to marketing marketing
relationship
receive marketing information.
In an Internet context, this is the practice of
tailoring Web a. Web b. customer- d. personalizati
11 c. client/server d
pages to individual users’ characteristics or services facing on
preferences.
This is the processing of data about
customers and their c. customer
a. clickstream b. database d. CRM
12 relationship with the enterprise in order to relationship d
analysis marketing analytics
improve the enterprise’s future sales and management
service and lower cost.
This is a broad category of applications and
technologies for c. business
a. best d. business
13 gathering, storing, analyzing, and providing b. data mart information d
practice intelligence
access to data to help enterprise users make warehouse
better business decisions.

OptimusPrime Page 78
This is a systematic approach to the
gathering, consolidation, d. service
a. database b. marketing c. application
14 and processing of consumer data (both for oriented a
marketing encyclopedia integration
customers and potential customers) that is integration
maintained in a company’s databases.
This is an arrangement in which a company
outsources some b. supplier d. Customer
a. spend
15 or all of its customer relationship relationship c. hosted CRM Information c
management
management functions to an application management Control System
service provider (ASP).
This is an XML-based metalanguage
developed by the Business
Process Management Initiative (BPMI) as a
16 means of modeling a. BizTalk b. BPML c. e-biz d. ebXML b
business processes, much as XML is, itself,
a metalanguage
with the ability to model enterprise data.
This is a central point in an enterprise from
a. contact c. multichannel
17 which all customer b. help system d. call center a
center marketing
contacts are managed.
This is the practice of dividing a customer
base into groups of b. customer
a. customer c. customer life d. customer
18 individuals that are similar in specific ways managed d
service chat cycle segmentation
relevant to marketing, such as age, gender, relationship
interests, spending habits, and so on.
In data mining, this is a technique used to
a. predictive b. disaster d. predictive
19 predict future behavior c. phase change d
technology recovery modeling
and anticipate the consequences of change.
1. According to analysts, for what can
Data
traditional IT systems provide a foundation Big data Collecting and
warehousing Management of
when management storing
20 and Hadoop a
they’re integrated with big data and data unstructured
business clusters
technologies mining data
intelligence
like Hadoop?
Distributed
All of the following accurately describe
21 Open source Real-time Java-based computing b
Hadoop, EXCEPT:
approach
__________ has the world’s largest Hadoop None of the
22 Apple Datamatics Facebook c
cluster. mentioned
All of the
23 What are the five V’s of Big Data? Volume velocity Variety d
above
_________ hides the limitations of Java
24 behind a powerful Scalding Cascalog Hcatalog Hcalding b
and concise Clojure API for Cascading.
What are the main components of Big
25 MapReduce HDFS YARN All of these d
Data?
What are the different features of Big Data
26 Open-Source Scalability Data Recovery All the above d
Analytics?
Define the Port Numbers for NameNode,
All of the
27 Task Tracker and NameNode Task Tracker Job Tracker d
above
Job Tracker.
Facebook Tackles Big Data With _______
28 Project Prism Prism ProjectData ProjectBid a
based on Hadoop
What is a unit of data that flows through a
29 Record Event Row Log b
Flume agent?

OptimusPrime Page 79
A feature F1 can take certain value: A, B,
Feature F1 is Feature F1 is an It doesn’t
C, D, E, & F and represents grade of
an example example belong to any
30 students from a college. Which of the Both of these b
of nominal of ordinal of the above
following statement is true in the following
variable. variable. category.
case
Which of the following is an example of a
None of the all of the
31 deterministic PCA K-Means a
above above
algorithm?
-(5/8 log(5/8)
What is the entropy of the target 5/8 log(5/8) + 5/8 log(5/8) + 5/8 log(3/8) –
32 + 3/8 a
variable? 3/8 log(3/8) 3/8 log(3/8) 3/8 log(5/8)
log(3/8))
a) OLAP is
an umbrella
term that
refers to an c) BI makes an
b) Business
assortment of organization
intelligence
software agile
equips
applications thereby giving None of the
33 Point out the correct statement. enterprises to b
for analyzing it a lower edge mentioned
gain business
an in today’s
advantage from
organization’s evolving market
data
raw data for condition
intelligent
decision
making
a) Distinguish b) Rank
c) Ranks
the products customers and
BI can catalyze a business’s success in customers and d) All of the
34 and services locations d
terms of _____________ locations based mentioned
that drive based on
on probability
revenues profitability
Which of the following areas are affected d) All of the
35 a) Revenue b) CRM c) Sales b
by BI? mentioned
Which of the following does not form part
36 a) SSRS b) SSIS c) SSAS d) OBIEE d
of BI Stack in SQL Server?
a) Distinguish
the b) Rank c) Ranks
BI can catalyze a business’s success products and customers and customers and d) All of the
37 d
in terms of _____________ services that locations based locations based mentioned
drive on profitability on probability
revenues
A set of
databases An approach to Information that
from different a problem that is hidden in a
vendors, is not database and
38 Heuristic is possibly guaranteed to that cannot be None of these b
using work but recovered by a
different performs well simple SQL
database in most cases query.
paradigms
In an Internet context, this is the practice of
tailoring Web a. Web b. customer- d. personalizati
39 c. client/server d
pages to individual users’ characteristics or services facing on
preferences.

OptimusPrime Page 80
A set of
databases An approach to Information that
from different a problem that is hidden in a
b vendors, is not database and
40 Heterogeneous databases referred to possibly guaranteed to that cannot be None of these a
using work but recovered by a
different performs well simple SQL
database in most cases. query.
paradigms
No
Question a b c d ANS
.
Eg a/b/c/
Write down question Option a Option b Option c Option d
. d
Movie Recommendation systems are an Reinforcement b and
1 Classification Clustering Regression
example of: Learning c
Reinforcement a,b
2 Sentiment Analysis is an example of: Regression Classification Clustering
Learning and d
What is the minimum no. of variables/
3 0 1 2 3 b
features required to perform clustering?
Is it possible that Assignment of
4 observations to clusters does not change Yes No Can’t say None of these a
between successive iterations in K-Means
Assignment of
observations to
clusters does Centroids do
Terminate
For a fixed not change not change
Which of the following can act as possible when RSS falls
5 number of between between a,b,c,d
termination conditions in K-Means? below a
iterations. iterations. successive
threshold.
Except for iterations.
cases with a bad
local minimum.
Expectation-
Which of the following clustering K- Means Agglomerative Diverse
Maximization a and
6 algorithms suffers from the problem of clustering clustering clustering
clustering c
convergence at local optima? algorithm algorithm algorithm
algorithm
K-means K-medians K-modes K-medoids
Which of the following algorithm is most
7 clustering clustering clustering clustering a
sensitive to outliers?
algorithm algorithm algorithm algorithm
Creating Creating an
Creating an Creating an
How can Clustering (Unsupervised different input feature for
input feature for input feature
Learning) be used to improve the accuracy models for cluster
8 cluster ids as an for cluster size a,b,c,d
of Linear Regression model (Supervised different centroids as a
ordinal as a continuous
Learning): cluster continuous
variable. variable.
groups. variable.
What could be the possible reason(s) for
producing two different dendrograms using Proximity of data points of variables All of the
9 d
agglomerative clustering algorithm for the function used used used above
same dataset?
Data points Data points Data points
In which of the following cases will K- Data points a,b,an
10 with different with round with non-
Means clustering fail to give good results? with outliers dd
densities shapes convex shapes
mputation with
Which of the following is/are valid iterative Nearest
Imputation Expectation All of the
11 strategy for treating missing values before Neighbor c
with mean Maximization above
clustering analysis? assignment
algorithm

OptimusPrime Page 81
In distance
You always get In Manhattan
calculation it
Feature scaling is an important step before the same distance it is an
will give the
12 applying K-Mean algorithm. What is reason clusters. If you important step None of these a
same weights
behind this? use or don’t use but in Euclidian
for all
feature scaling it is not
features
Which of the following method is used for
Elbow Manhattan Ecludian All of the
13 finding optimal of cluster in K-Mean a
method method mehthod above
algorithm?
K-means is Bad Bad
extremely initialization initialization
14 What is true about K-Mean Clustering? sensitive to can lead to Poor can lead to bad None of these d
cluster center convergence overall
initializations speed clustering
Try to run
Which of the following can be applied to algorithm for Find out the
Adjust number
15 get good results for K-means algorithm different optimal number None of these a,b,c
of iterations
corresponding to global minima? centroid of clusters
initialization
If you are using Multinomial mixture All the data All the data All the data
All the data
models with the expectation-maximization points follow n points follow points follow n
points follow
16 algorithm for clustering a set of data points Gaussian two multinomial c
two Gaussian
into two clusters, which of the assumptions distribution (n multinomial distribution (n
distribution
are important: >2) distribution >2)
Both have
Which of the following is/are not true about Expectation
strong
Centroid based K-Means clustering Both starts Both are maximization
assumptions
17 algorithm and Distribution based with random iterative algorithm is a d
that the data
expectation-maximization clustering initializations algorithms special case of
points must
algorithm: K-Means
fulfill
For data
points to be in It has strong It has It does not
a cluster, they assumptions for substantially require prior
Which of the following is/are not true about b and
18 must be in a the distribution high time knowledge of
DBSCAN clustering algorithm: c
distance of data points complexity of the no. of
threshold to a in dataspace order O(n3) desired clusters
core point
Which of the following are the high and low None of the
19 [0,1] (0,1) [-1,1] a
bounds for the existence of F-Score? above
a. Increased
1. All of the following increase the width b. Increased c. Increased d. Decreased
20 confidence c
of a confidence interval except: variability sample size sample size
level
d. The
c. The probability of
a. The
probability observing
probability of b. The
that the results as
3The p-value in hypothesis testing failing to probability
observed results extreme or
represents reject the null that the null
21 are statistically more extreme d
which of the following: Please select the hypothesis, hypothesis is
significant, than currently
best answer of those provided below. given the true, given the
given that the observed,
observed observed results
null hypothesis given that the
results
is true null hypothesis
is true

OptimusPrime Page 82
4. Assume that the difference between the
observed, paired sample values is defined in
the same manner and that the specified
significance level is the same for both
hypothesis tests. Using the same data, the
a. Always c. Sometimes d. Not Enough
22 statement that “a paired/dependent two b. Never True a
True True Information
sample t-test is equivalent to a one sample t-
test on the paired differences, resulting in
the same test statistic, same p-value, and
same conclusion” is: Please select the best
answer of those provided below.
19. Green sea turtles have normally
distributed weights, measured in kilograms,
with a mean of 134.5 and a variance of
23 49.0. A particular green sea turtle’s weight a. 17 kg b. 151 kg c. 118 kg d. 252 kg c
has a z-score of -2.4. What is the weight of
this green sea turtle? Round to the nearest
whole number.
What percentage of measurements in a
d. Cannot Be
24 dataset a. 49% b. 50% c. 51% d
Determined
fall above the median?
24. The proportion of variation in 5k race
times that can be explained by the variation
in the age of competitive male runners was
25 a. 0.663 b. 0.814 c. -0.814 d. 0.440 c
approximately 0.663. What is the value of
the sample linear correlation coefficient?
Round to 3 decimal places.
a. Yes; linear c. No; linear
b. Yes; both the d. No; the age
correlation correlation
25. Using all of the results provided, is it sample linear provided
between age between age
reasonable to predict the 5k race time regression is beyond the
26 and 5k race and 5k race d
(minutes) of a competitive male runner 73 equation and an scope of our
times is times is not
years of age? age in years is available
statistically statistically
provided sample data
significant significant
It uses
machine- Science of
learning making
Computational
techniques. machines
procedure that
Here program performs tasks
takes some
can learn that would
27 Algorithm is value as input None of these b
from past require
and produces
experience intelligence
some value as
and adapt when
output
themselves to performed by
new humans
situations

OptimusPrime Page 83
An approach to
the design of
learning
algorithms that
A class of
is inspired by
learning
the fact that
algorithm that
Any mechanism when people
tries to find
employed by a encounter new
an optimum
learning system situations, they
28 Bias is classification None of these b
to constrain the often explain
of a set of
search space of them by
examples
a hypothesis reference to
using the
familiar
probabilistic
experiences,
theory
adapting the
explanations to
fit the new
situation.

A measure of
A subdivision the accuracy, of The task of
of a set of the assigning a
29 Classification is examples into classification of classification to None of these a
a number of a concept that is a set of
classes given by a examples
certain theory
This takes
only two
Systems that
values. In
The natural can be used
general, these
environment of without
30 Binary attribute are values will be None of these a
a certain knowledge of
0 and 1 and
species internal
.they can be
operations
coded as one
bit
Measure of the
A subdivision The task of
accuracy, of the
of a set of assigning a
classification of
31 Classification accuracy is examples into classification to None of these b
a concept that is
a number of a set of
given by a
classes examples
certain theory
Operations on a
Group of database to Symbolic
similar transform or representation
objects that simplify data in of facts or ideas
32 Cluster is differ order to prepare from which None of these a
significantly it for a information can
from other machine- potentially be
objects learning extracted
algorithm
A definition of a concept is-----if it
33 Complete Consistent Constant None of these a
recognizes all the instances of that concept
A definition or a concept is------------- if it
34 classifies any examples as coming within Complete Consistent Constant None of these b
the concept

OptimusPrime Page 84
A subject-
The actual oriented
discovery The stage of integrated time
phase of a selecting the variant non-
35 Data selection is None of these b
knowledge right data for a volatile
discovery KDD process collection of
process data in support
of management
A measure of
A subdivision the accuracy, of The task of
of a set of the assigning a
36 Classification task referred to examples into classification of classification to None of these c
a number of a concept that is a set of
classes given by a examples
certain theory
Decision
Approach to the support systems
design of that contain an
Combining
learning information
different
algorithms that base filled with
37 Hybrid is types of None of these a
is structured the knowledge
method or
along the lines of an expert
information
of the theory of formulated in
evolution. terms of if-then
rules.
An extremely
It is hidden
The process of complex
within a
executing molecule that
database and
implicit occurs in
can only be
previously human
recovered if
38 Discovery is unknown and chromosomes None of these b
one is given
potentially and that carries
certain clues
useful genetic
(an example
information information in
IS encrypted
from data the form of
information).
genes.
What could be the possible reason(s) for
producing two different dendrograms using Proximity of data points of variables All of the
39 d
agglomerative clustering algorithm for the function used used used above
same dataset?
Is it possible that Assignment of
40 observations to clusters does not change Yes No Can’t say None of these a
between successive iterations in K-Means

No
Question a b c d ANS
.
Eg a/b/c/
Write down question Option a Option b Option c Option d
. d
This clustering algorithm terminates when
mean values computed for the current
K-Means conceptual expectation agglomerative
1 iteration of the algorithm are identical to the a
clustering clustering maximization clustering
computed mean values for the previous
iteration
As the value of As the value of
The attributes one attribute one attribute
The correlation coefficient for two real- The attributes
are not decreases the increases the
2 valued attributes is –0.85. What does this show a linear b
linearly value of the value of the
value tell you? relationship
related. second attribute second attribute
increases. also increases.

OptimusPrime Page 85
Y is false
Given a rule of the form IF X THEN Y, rule Y is true when X is true when X is false when
when X is
3 confidence is defined as the conditional X is known to Y is known to Y is known to b
known to be
probability that be true. be true be false.
false.
Density based Hierarchical
Partitioning Model based
4 Chameleon is clustering clustering d
based algorithm algorithm
algorithm algorithm
5 Find odd man out DBSCAN K-Mean PAM None of above a
decreases with
increases with
increases with decreases with increase in size
The number of iterations in apriori the size of the
6 the size of the the increase in of the c
___________ maximum
data size of the data maximum
frequent set
frequent set
Which of the following are interestingness
7 Recall Lift Accuracy All of Above b
measures for association rules?
2k – 1 2k – 2
2k candidate 2k -2 candidate
Given a frequent itemset L, If |L| = k, then candidate candidate
8 association association c
there are association association
rules rules
rules rules
_________ is an example for case based- Neural Genetic K-nearest
9 Decision trees d
learning networks algorithm neighbor
The average positive difference between mean positive mean squared mean absolute root mean
10 c
computed and desired outcome values. error error error squared error
Superset of
both closed
Superset of Superset of Subset of
frequent item
only closed only maximal maximal
11 Frequent item sets is sets and d
frequent item frequent item frequent item
maximal
sets sets sets
frequent item
sets
Assume that we have a dataset containing
information about 200 individuals. A
supervised data mining session has
discovered the following rule: IF age < 30
& credit card insurance = yes THEN life
12 63 38 40 89 b
insurance = yes Rule
Accuracy: 70% and Rule
Coverage: 63% How many individuals in
the class life insurance= no have credit card
insurance and are less than 30 years old?
Simple Grouping Labeled Query results
13 Which of the following is cluster analysis? b
segmentation similar objects classification grouping

high inter low intra class

A good clustering method will produce high high intra class
14 class similarity None of above c
quality clusters with similarity
similarity

Which two parameters are needed for Min points and Min sup and Number of
15 Min threshold b
DBSCAN eps min confidence centroids
Both
techniques
build models
whose Both models
The output of Both models
output is require numeric
Which statement is true about neural both models is a require input
16 determined attributes to d
network and linear regression models? categorical attributes to be
by a linear range between
attribute value. numeric.
sum of 0 and 1.
weighted
input attribute
values.

OptimusPrime Page 86
In Apriori algorithm, if 1 item-sets are 100,
17 100 200 4950 5000 c
then the number of candidate 2 item-sets are
Finding
Significant Bottleneck in the Apriori Candidate Number of
18 frequent Pruning c
algorithm is generation iterations
itemsets
typically
are better able
Machine learning techniques differ from assume an have trouble are not able to
to deal with
19 statistical techniques in that machine underlying with large-sized explain their a
missing and
learning methods distribution for datasets behavior.
noisy data
the data
The probability of a hypothesis before the
20 a priori posterior conditional subjective a
presentation of evidence.
21 KDD represents extraction of data knowledge rules model b
Outliers
Outliers should
should be part The nature of
Outliers should be part of the
of the training the problem
be identified test dataset but
22 Which statement about outliers is true? dataset but determines how c
and removed should not be
should not be outliers are
from a dataset. present in the
present in the used
training data.
test data.
23 The most general form of distance is Manhattan Eucledian Mean Minkowski d
High support High support Low support Low support
24 Which Association Rule would you prefer and medium and low and high and low c
confidence confidence confidence confidence
In a Rule based classifier, If there is a rule
Mutually
25 for each combination of attribute values, Exhaustive Inclusive Comprehensive a
exclusive
what do you called that rule set R
To decrease the To improve the
If a set cannot If a set can
efficiency, do efficiency, do
pass a test, its pass a test, its
level-wise level-wise
26 The apriori property means supersets will supersets will a
generation of generation of
also fail the fail the same
frequent item frequent item
same test test
sets sets
If an item set ‘XYZ’ is a frequent item set,
27 Undefined Not frequent Frequent Can not say c
then all subsets of that frequent item set are
The probability that a person owns a sports
car given that they subscribe to automotive
magazine is 40%. We also know that 3% of
the adult population subscribes to
automotive magazine. The probability of a
28 person owning a sports car given that they 0.0368 0.0396 0.0389 0.0398 b
donâ€™t subscribe to automotive magazine
is 30%. Use this information to compute
the probability that a person subscribes to
automotive magazine given that they own a
sports car
Simple regression assumes a __________
29 relationship between the input attribute and quadratic inverse linear reciprocal c
output attribute.
Only Both minimum
Neither support Minimum
To determine association rules from minimum support and
30 not confidence support is c
frequent item sets confidence confidence are
needed needed
needed needed
If {A,B,C,D} is a frequent itemset,
31 C –> A D –>ABCD A –> BC B –> ADC b
candidate rules which is not possible is
High support Low support Low support High support
32 Which Association Rule would you prefer and low and high and low and medium b
confidence confidence confidence confidence
OptimusPrime Page 87
Classification rules are extracted from
33 decision tree root node branches siblings a
_____________
What does K refers in the K-Means
. number of
34 algorithm which is a non-hierarchical Complexity Fixed value No of iterations d
clusters
clustering approach?
If Linear regression model perfectly first Test error is Couldn’t Test error is
Test error is
35 i.e., train error is zero, then also always comment on equal to Train c
non zero
_____________________ zero Test error error
Which of the following metrics can be used
for evaluating regression models? i)R
ii and iv i and ii ii, iii and iv i, ii, iii and iv d
Squared ii) Adjusted R Squared iii) F
Statistics iv) RMSE/MSE/MAE
How many coefficients do you need to
37 estimate in a simple linear regression model 1 2 3 4 b
(One independent variable)?
In a simple linear regression model (One
independent variable), If we change the
38 by 1 no change by intercept by its slope d
input variable by 1 unit. How much output
variable will change?
In syntax of linear model
39 Matrix array vector list c
lm(formula,data,..), data refers to ______
In the mathematical Equation of Linear
(X-intercept, (Slope, X- (Y-Intercept, (slope, Y-
40 Regression Y = β1 + β2X + ϵ, (β1, β2) c
Slope) Intercept) Slope) Intercept)
refers to __________

No
Question a b c d ANS
.
Eg a/b/c/
Write down question Option a Option b Option c Option d
. d
A _________ is a decision support tool that
uses a tree-like graph or model of decisions
Neural
1 and their possible consequences, including Decision tree Graphs Trees a
Networks
chance event outcomes, resource costs, and
utility.
2 Decision Tree is a display of an algorithm. TRUE FALSE a
Flow-Chart &
Structure in
Structure in
which internal
which internal
node represents
node represents
test on an
test on an
attribute, each
attribute, each
3 What is Decision Tree? branch None of Above c
branch
represents
represents
outcome of test
outcome of test
and each leaf
and each leaf
node represents
node represents
class label
class label

Decision Trees can be used for

4 TRUE FALSE a
Classification Tasks.
Choose from the following that are Decision
5 End Nodes Chance Nodes All of Above d
Decision Tree nodes? Nodes
Decision Nodes are represented by
6 Disks Squares Circles Triangles b
____________
Chance Nodes are represented by
7 Disks Squares Circles Triangles c
__________
8 End Nodes are represented by __________ Disks Squares Circles Triangles d

OptimusPrime Page 88
Worst, best and
Use a white box
expected values
Possible model, If given
Which of the following are the advantage/s can be
9 Scenarios can result is All of Above d
of Decision Trees? determined for
be added provided by a
different
model
scenarios
Attributes are Attributes are
statistically statistically
Attributes are Attributes can
Which of the following statements about dependent of independent of
10 equally be nominal or b
Naive Bayes is incorrect? one another one another
important. numeric
given the class given the class
value. value.
Which of the following is not supervised Linear
11 Clustering Decision Tree Naive Bayesian a
learning? Regression
How many terms are required for building
12 1 2 3 4 c
a bayes model?
Answering
Solving Increasing Decreasing
13 Where does the bayes rule can be used? probabilistic d
queries complexity complexity
query
How the bayesian network can be used to Full Joint Partial
14 All of Above b
answer any query? distribution distribution distribution
Both
What is the consequence between a node
Functionally Conditionally Conditionally
15 and its predecessors while creating bayesian Dependant c
dependent independent dependant &
network?
Dependant

An approach to
the design of
learning
algorithms that
A class of
is inspired by
learning
the fact that
algorithm that
Any mechanism when people
tries to find
employed by a encounter new
an optimum
learning system situations, they
16 Bayesian classifiers is classification None of these a
to constrain the often explain
of a set of
search space of them by
examples
a hypothesis reference to
using the
familiar
probabilistic
experiences,
theory.
adapting the
explanations to
fit the new
situation.

OptimusPrime Page 89
An approach to
the design of
learning
algorithms that
A class of
is inspired by
learning
the fact that
algorithm that
Any mechanism when people
tries to find
employed by a encounter new
an optimum
learning system situations, they
17 Bias is classification None of these b
to constrain the often explain
of a set of
search space of them by
examples
a hypothesis reference to
using the
familiar
probabilistic
experiences,
theory
adapting the
explanations to
fit the new
situation.

Additional
acquaintance
used by a A neural
It is a form of
learning network that
18 Background knowledge referred to automatic None of these a
algorithm to makes use of a
learning.
facilitate the hidden layer
learning
process
A measure of
A subdivision the accuracy, of The task of
of a set of the assigning a
19 Classification accuracy is examples into classification of classification to None of these b
a number of a concept that is a set of
classes given by a examples
certain theory
A measure of
A subdivision the accuracy, of The task of
of a set of the assigning a
20 Classification is examples into classification of classification to None of these a
a number of a concept that is a set of
classes given by a examples
certain theory
An extremely
It is hidden
The process of complex
within a
executing molecule that
database and
implicit occurs in
can only be
previously human
recovered if
21 Discovery is unknown and chromosomes None of these b
one is given
potentially and that carries
certain clues
useful genetic
(an example
information information in
IS encrypted
from data the form of
information).
genes.
A measure of
A subdivision the accuracy, of The task of
of a set of the assigning a
22 Classification task referred to examples into classification of classification to None of these c
a number of a concept that is a set of
classes given by a examples
certain theory

OptimusPrime Page 90
The process of
finding a
solution for a
A stage of the problem simply The distance
KDD process by enumerating between two
in which new all possible points as
23 Euclidean distance measure is None of these c
data is added solutions calculated using
to the existing according to the Pythagoras
selection. some pre- theorem
defined order
and then testing
them
The problem of finding hidden structure in Supervised Unsupervised Reinforcement
24 None of these b
unlabeled data is called learning learning learning
Assume you want to perform supervised
learning and to predict number of newborns Structural
25 according to size of storks’ population Classification Regression Clustering equation b
(http://www.brixtonhealth.com/storksBabie modeling
s.pdf), it is an example of
Discriminating between spam and ham e-
26 TRUE FALSE a
mails is a classification task, true or false?
which of the following is not involve in data Knowledge Data Data Data
27 d
mining? extraction archaeology exploration transformation
A class of
A prediction
learning A table with n
made using an
algorithms independent
extremely
that try to attributes can
28 Naive prediction is simple method, None of these c
derive a be seen as an n-
such as always
Prolog dimensional
predicting the
program from space.
same output.
examples
In the context
of KDD and
One of the
data mining,
A component defining aspects
29 Node is this refers to None of these a
of a network of a data
random errors
warehouse
in a database
table.
One of several
possible enters Discipline in
within a statistics that
The result of
database table studies ways to
the
that is chosen find the most
application of
30 Prediction is by the designer interesting None of these a
a theory or a
as the primary projections of
rule in a
means of multi-
specific case
accessing the dimensional
data in the spaces.
table.
What is the relation between the distance
inversely-
31 between clusters and the corresponding proportional no-relation None of these a
proportional
class discriminability?
the classification method in which the upper
exclusive inclusive mid point
32 limit of interval is same as of lower class None of these a
method method method
interval is called….
larger value is 60 and the smallest value is
33 40 and the number of classes is 5 then the 20 25 4 15 c
class interval is

OptimusPrime Page 91
summary and presentation of data in tabular
nominal frequency ordinal
34 form with several non overlapping classes is None of these b
distribution distribution distribution
referred as
the classification method in which the upper
exclusive inclusive mid point
35 and lower limit of interval is also in class None of these b
method method method
interval itself is called….
Suppose there are 25 base classifiers. Each
classifier has error rates of e = 0.35.
Suppose you are using averaging as
36 0.05 0.06 0.07 0.08 b
ensemble of above 25 classifiers will make
a wrong prediction? Note: all classifiers are
independent of each other
The most widely used metrics and tools to Confusion Cost-sensitive Area under the
37 All of Above d
assess a classification model are: matrix accuracy ROC curve
Normalize the
Normalize PCA →
When performing regression or data → PCA →
the data → normalize PCA
38 classification, which of the following is the normalize PCA None of these a
PCA → output →
correct way to preprocess the data? output →
training training
training
Assumes that
all the Assumes that
Which of the following is true about Naive features in a all the features
39 both a and b None of these c
Bayes ? dataset are in a dataset are
equally independent
important
In which of the following cases will K-
means clustering fail to give good results?
40 1) Data points with outliers 2) Data points 1 and 2 2 and 3 1, 2, and 3 1 and 3 c
with different densities 3) Data points with
nonconvex shapes

No
Question a b c d ANS
.
Pictorial
numerical numerical
1 Data visualtization is realted with… representaion None of these a
representation calculations
s
Which of the following are Use of data See context of Clear data finding pattern
2 all of above d
visualtization data understanding in data
Which of the following statements are true
about using visualizations to display a
dataset? I. Visualizations are visually
appealing, but don’t help the viewer
understand relationships that exist in the
data
3 I AND II II AND III I AND III ONLY III d
II. Visualizations like graphs, charts, or
visualizations with pictures are useful for
conveying information, while tables just
filled with text are not useful.

III. Patterns that exist in the data can be

found more easily by using a visualization
The plot method on Series and DataFrame
none of the
4 is just a simple wrapper around gplt.plot() plt.plot() plt.plotgraph() b
mentioned
____________
Point out the correct combination with ‘hist’ for ‘box’ for ‘area’ for area all of the
5 d
regards to kind keyword for graph plotting. histogram boxplot plots mentioned
Which of the following value is provided by none of the
6 bar bar bar a
kind keyword for barplot? mentioned

OptimusPrime Page 92
You can create a scatter plot matrix using
all of the
7 the __________ method in sca_matrix scatter_matrix DataFrame.plot b
mentioned
pandas.tools.plotting.
Plots may also be adorned with error bars or
8 True FALSE Cannot Tell All Above a
tables.
Which of the following plots are often used Autocausatio none of the
9 Autorank Autocorrelation c
for checking randomness in time series? n mentioned
__________ plots are used to visually
10 Lag RadViz Bootstrap All Above c
assess the uncertainty of a statistic
Which of the following is not a challenge in
11 Velocity Volume Version Variety c
Big Data Visualization>?
Which of the following is not a problem in Large image Information
12 Visual Noise Scaled Data b
Big Data Visualization>? perception Loss
Which of the following is a problem in Big Structured Multiple
13 Scaled Data Visual Noise c
Data Visualization>? Data valued Data
Which of the candidate is suitable for Type of
14 Cardinality Size of data all of above d
interactive visualtization? Visual
Which of the following follows interactive Overview+Deta
15 Zoom+Pan Focus+Context all of above d
visualization approach? ils
Overview+Deta
16 Visual Mapping is important for_______ Remapping Focus Context a
ils
17 Data visualtization techniques are: Scatter Plot Line Chart Pie Chart all of above d
18 Information Visualtization techniques are Flow Chart Time Line DFD All of above d
19 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these c
20 Information Visualtization techniques are Flow Chart Line Chart Pie Chart None of these a
21 Data visualtization techniques are: Scatter Plot Time Line DFD None of these a
22 Information Visualtization techniques are Scatter Plot Time Line Bubble Chart None of these b
Parallel
23 Data visualtization techniques are: Histogram Time Line None of these a
Coordinates
Semantic
24 Information Visualtization techniques are Histogram Area Chart None of these a
Network
Which of the following is realted term with
25 Exponential U-Shape Null All of above d
correlation?
26 Data visualtization techniques are: Scatter Plot Time Line DFD None of these a
27 Coulmn graph is another name for _____ Bar Chart Scatterplot Histogram Area Chart a
Which of the following follows interactive Overview+Deta
28 Zoom+Pan Focus+Context all of above d
visualization approach? ils
29 information Visualtization techniques are Pie Chart Scatterplot Histogram Area Chart a
Which of the following is category of Linear Modular Variant
30 ER Timeline a
timeline? Timeline Timeline Timeline
Which of the following specifies
31 Scatter Plot Line Chart Area Chart All of above d
relationship amongst variables?
Which of the following specifies category
32 Pie Chart Histogram Bar chart All of above d
Proportions?
Which of the following is category of Variant Comarative Modular
33 ER Timeline c
timeline? Timeline Timeline Timeline
34 Information Visualtization techniques are Flow Chart Time Line DFD All of above d
35 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these c
Pictorial
numerical numerical
36 Data visualtization is realted with… representaion None of these a
representation calculations
s
Which of the following follows interactive Overview+Deta
37 Zoom+Pan Focus+Context all of above d
visualization approach? ils
Which of the following are Use of data See context of Clear data finding pattern
38 all of above d
visualtization data understanding in data

OptimusPrime Page 93
Which of the following specifies
39 Pie Chart Histogram Area Chart None of these c
relationship amongst variables?
Which of the following specifies category
40 Pie Chart Scatter Plot Line Chart None of these a
Proportions?

No
Question a b c d ANS
.
Eg a/b/c/
Write down question Option a Option b Option c Option d
. d
Structured Un Structured semi Structured Quasi
1 Precies and steady format data is____ a
Data Data Data Structured Data
Structured Un Structured semi Structured Quasi
2 Inconsistant Data is______ b
Data Data Data Structured Data
Structured Un Structured semi Structured Quasi
3 Format that self defines itself is________ c
Data Data Data Structured Data
Structured Un Structured semi Structured Quasi
4 A little Bit inconsistant data is_______ d
Data Data Data Structured Data
Structured Un Structured semi Structured Quasi
5 XML is an example of_______
Data Data Data Structured Data
Structured Un Structured semi Structured Quasi
6 RDBMS Folllows__________ a
Data Data Data Structured Data
7 Watson is developed by____ IBM Microsoft AT&T Google a
8 Hadoop is _____ based Framework. C++ Python JAVA C# c
Which of the following are components of MAPREDUC
9 YARN HDFS All of Above d
Hadoop? E
Which of the following are components of
10 JDBC Thrift Server CLI All of Above d
HIVE?
JAVA
Mountable
11 Mahout provides__________ Executable C# Executables All of Above a
Image Format
Libraries
Which of the following are components of
12 FLATTEN Thrift Server Muster None of these b
HIVE?
Which of the following are components of
13 FLATTEN Thrift Server Muster All of above b
HIVE?
Which of the following is components of
14 Fork YARN CLI Metadata b
Hadoop?
Structured Un Structured semi Structured Quasi
15 RDBMS Folllows__________ a
Data Data Data Structured Data
Which of the following is a clustering Fuzzy K
16 Canopy K-Means All of above d
techique? means
Which of the following is HBASE Data
17 Row Table Column All of Above d
Model Terminology?
Which of the following is not a Logistic Recommender
18 Random Forest Naïve Bayes c
classification techique? Regression Algo
Which of the following is a classification Logistic
19 Random Forest Naïve Bayes All of Above d
techique? Regression
Which of the following is HBASE Data Column
20 Cell Timestamp All of Above d
Model Terminology? Family
Which of the following is a clustering Logistic
21 Random Forest K-Means Naïve Bayes c
techique? Regression
Which of the following is HBASE Data None of the
22 Identifier Variant Timestamp c
Model Terminology? above
Which of the following is not a Logistic
23 Random Forest K-Means Naïve Bayes c
classification techique? Regression
Which of the following are components of
24 FLATTEN Thrift Server Muster None of these b
HIVE?

OptimusPrime Page 94
Which of the following is HBASE Data Column None of the
25 Identifier Variant c
Model Terminology? Qualifier above
JAVA
Mountable None of the
26 Mahout provides__________ Executable C# Executables a
Image Format above
Libraries
Which of the following is not a clustering Logistic
27 Canopy K-Means Fuzzy K means a
techique? Regression
Which of the following is a clustering Fuzzy K
28 Canopy K-Means All of above d
techique? means
Hadoop do In Hadoop
Hadoop 2.0
need programming
allows live
specialized framework None of the
29 Point out the correct statement. stream b
hardware to output files are above
processing of
process the divided into
real-time data
data lines or records
30
A sound
Creator Doug
Cutting’s high The toy Cutting’s
Cutting’s
31 What was Hadoop named after? school rock elephant of laptop made c
favorite
band Cutting’s son during Hadoop
circus act
development
___________programming model used to
None of the
32 develop Hadoop-based applications that can MapReduce Mahout Oozie a
above
process massive amounts of data.
Which of the following is not a Logistic
33 Random Forest K-Means Naïve Bayes c
classification techique? Regression
Which of the following are components of
34 FLATTEN Thrift Server Muster All of above b
HIVE?
Which of the following is components of
35 Fork YARN CLI None of above b
Hadoop?
Hadoop is a framework that works with a MapReduce, MapReduce, MapReduce,
36 variety of related tools. Common cohorts Hive and MySQL and Hummer and All of above a
include ____________ HBase Google Apps Iguana
NoSQL databases is used mainly for
Structured Un Structured semi Structured Quasi
37 handling large volumes of ______________ b
Data Data Data Structured Data
data.
Which of the following is not a phase of Communicati Data Model
38 Recall b
Data Analytics Life Cycle? on Preparation Planning
Which of the following is a NoSQL Document
39 SQL JSON All of above b
Database Type? databases
Which of the following is not a NoSQL None of the
40 SQL Server MongoDB Cassandra a
database above

OptimusPrime Page 95
marks question A B C D ans
A group of 4 bits is also
0 1 Nibble Byte Kb None 4 bits make one nibble.
called?
There are how many types of
1 1 3 2 1 None Big Data is of 3 types.
Big Data:
Which of the following are the
2 1 All Volume Variety Velocity. This is an explaination.
V's of Big Data:
Which of these is not a
3 1 Storage Volume Variety Velocity. This is an explaination.
characterstic of Big data?
Which of the following is a Big Data requires high cost to
4 2 Cost Significant Process Fraud Detection
drawback of Big Data: maintain huge amount of data
GINA stands for Global
Global Innovation Network and Global Invention in Globally Investment in
5 2 Fullform of GINA is: None Innovations Networks and
Analysis. Networks and Analytics Neurons and Analytics
Analysis.
Which is the phase 3 in Data Model Planning is the 3rd phase
6 2 Model Planning Model Building Data Preparation Operationalize
Analytics Life cycle. in life cycle.
GINA team thought to GINA targeted to achieve three
7 2 3 2 1 5
accomplish mainly____ goals: goals for the project.
The Data Preparation stage
8 2 Analyzation Collection Cleansing Processing. This is an explaination.
doesn’t involve:
Unstructured Data is further Unstructured data is divided into
9 2 2 3 4 5
divided into how many types? 2 types.
The GINA team mainly used
The team used Tableau to
10 2 which software tool to analyze Tableau Hadoop HIVE SQL
visualize the Data.
the Data
Which of the follwing is the first
11 2 step of Data Analytics Life Discovery Data Preparation. Model Planning Data Aware This is an explaination.
Cycle:
There are how many phases in there are 6 stages in data
12 2 6 5 4 7
data analytics life cycle: analytics life cycle.
SEMMA Methodology has SEMMA methodology has five
13 2 5 4 6 7
how many stages: stages.
Which phase of Life Cycle
Phase 5 involves collaboration
14 2 requires collaboration with Phase 5 Phase 6 Phase 4 Phase 3
with stakeholders.
stakeholders?
In Building a Model, how many
15 2 2 3 4 5 This is an Explaination.
phases are required:
How much Data in the whole Only 20% of world's total data is
16 2 0.2 0.4 0.6 0.5
world is structured: structured.
10^7 bytes of memory is equal
17 2 1ZB 1TB 1YB 1XB 10^7 B is equal to 1 ZB.
to:
Data Scientists in the GINA
NLP technique was used on the
team used which technique on Natural Language
18 2 Hadoop HIVE SQL description of Innovation
the textual Description of the Processing(NLP)
Roadmap Idea.
Innovation Roadmap Idea.
How many types of data Two types of data anlytical
19 2 analytics methodologies are 2 4 3 6 methodologies are there. EDA
there? and CDA
Bell Curve is also known as
20 3 Other name for Bell Curve is: Normal Distribution. Poisson Distribution Bionomial Distribution Bernoulli Distribution.
normal distribution.
One of the most important tasks
One of the most important
21 3 Statical Modeling Testing of Data Visualization Operationalize in big data analytics is statistical
tasks in big data analytics is:
modeling
Some of the approaches
considered for building the data
22 3 All CRISP-DM SEMMA MAD Skills This is an explaination.
analytics lifecycle framework
best practices are:
In Phase 4, the team develops
23 3 All Testing of Data Training of Data Production purposes This is an explaination.
datasets for:
Cross International Company's Initial CRISP-DM stands for Cross
Fullform of CRISP-DM Cross Industry Standard Process Common Industry Standard
24 3 Standard Process for Standards Progress for Industry Standard Process for
Methodology is: for Data Mining Program for Data Mining
Data Modeling Data Methods Data Mining.
SEMMA Methodology
25 3 doesn’t include which of the Evaluate Sample Explore Asses This is an Explaination.
following stages:
In Which stage, the data is In last phase i.e. Opeartionalize
monitored and analyzed to see Data is monitored and analyzed
26 3 Operationalize Collection Plan Model Data Aware
if the generated model is to see if the generated model is
creating the expected results. creating the expected results.
Data is captured in how many
27 3 3 4 5 6 Data is captured in 3 main ways.
ways:

OptimusPrime Page 96
marks question A B C D ans
In phase 2 of the Data
The team performs ETL and
Anlaytics Life Cycle, the team
28 3 3 2 4 6 ELT and ETLT in 2nd phase of
performs how many analytics
the cycle.
to get the data in the sandbox.
The total area under the bell Area under the bell curve is 1
29 3 1 2 3 4
curve is____unit. unit.
Wilcoxon rank-sum test is also Wilcoxon rank-sum test is also
30 1 Mann-Whiteney U test Mean Difference Alternative Hypothesis Null Hypothesis
known as? called Mann- Whiteney U Test.
Which test is also known as T-
31 1 Hypothesis Test Mean Difference K-means test None This is an explaination.
test?
This eqn is of Mean difference
32 1 This equation is of which test? Mean Difference K-Means Null Hypothesis Alternative Hypothesis
test.
A test of a statistical A test of a statistical hypothesis,
hypothesis, where the region of where the region of rejection is
33 1 rejection is on a side of the One tailed test Two-tailed test Tailed test Null test on only one side of the sampling
sampling distribution, is distribution, is called a one-tailed
called___________. test
How many types of Statical There are two types of Statical
34 1 2 3 4 6
Hypothesis is there? Hypothesis.
Analysis of Variance is also ANOVA stands for Analysis of
35 1 ANOVA Mean Difference Alternative Hypothesis Null Hypothesis
refered as? Variance.
How many steps are involved There are 4 steps in Hypothesis
36 1 4 2 3 5
in a Hypothesis Testing? testing.
The strength of evidence in The strength of evidence in
37 2 support of a null hypothesis is P-value K-value H-value Null-value support of a null hypothesis is
measured by? measured by the P-value.
Difference in means is also Difference in means is also
38 2 Two sample t-test T- test M-test Two sample test
called? known as two sample t test.
The k-medoids is also The k-medoids is also called
Partitioning Around Medoids
39 2 called_______________ Lloyd's Algorithm Poisson's Algorithm Regression partitioning around medoids
(PAM)
algorithm. (PAM) algorithm .
Clustering is an example of Clustering is an example of
40 2 Unsupervised Learning Supervised Learning Classification Regression
____? unsupervised learning.
Which of the following is not an
41 2 advantage of K means Requires a Priori Fast Robust easy to evaluate. This is an explaination.
Clustering?
The probability of committing a The probability of committing a
42 2 Beta Alpha Delta Theta
Type 2 error is called Type II error is called Beta
The______ variation we have
The less variation we have within
within clusters, the more
clusters, the more homogeneous
43 2 homogeneous (similar) the data Less More Variable Fixed
(similar) the data points are
points are within the same
within the same cluster.
cluster.
Which hypothesis is usually the Null Hypothesis is usually the
hypothesis in which sample hypothesis that sample
44 2 Null-Hypothesis Mean Difference K-means test Alternative Hypothesis
observations result is purely observations result purely from
from chance? chance.
Classical" ANOVA for
Classical" ANOVA for balanced
45 2 balanced data does how many 3 2 1 4
data does three things at once.
things at once?
K-mean clustering is used to NP hard problems are solved
46 2 NP-hard problems NP Problems Hypothesis Problems P problems
solve which problems? using K means clustering.
The probability of committing a The probability of committing a
47 2 Alpha Beta Gama Delta
Type I error is called? Type I error is called alpha
K means Clustering is also K means clustering is also called
48 2 Lloyd's Algorithm Gaussian Algorithm Poisson's Algorithm None
known as? Lloyds algo.
Which algorithm requires the k-means clustering requires the
49 3 user to specify the number of K-means clustering Gaussian Algorithm Alternative Hypothesis Null Hypothesis user to specify the number of
clusters k to be generated. clusters k to be generated.
K means clsutering uses which expectation-maximization
50 3 approach to solve the Expectation-maximization Greedy Approach Divide and Conquer None technique is used by k means
problems? clustering.
How many factors affect the The power of a hypothesis test is
51 3 3 2 1 4
power of a hypothesis test? affected by three factors.
Law of variance is also called
52 3 Law of Variance is called? Eve's Law Laplace Law Poisson's Algorithm Regression
Eve's law.
K-Medoids use which K Medoids use greddy
53 3 Greedy Approach Divide and Conquer Recursive None
approach to solve problems? approach to solve problems
The time complexity of k Time complexity is O(n^2) of k
54 3 O(n^2) O(nlogn) O(n) O(1)
means clustering is? means clustering.
the number (k ) of clusters
The number k of clusters
55 3 assumed in k-medoids is Priori Null Hypothesis ANNOVA
OptimusPrime Effect size Page 97
assumed known as priori.
known as?
marks question A B C D ans
The effect size is the difference
What is the difference between
between the true value and the
56 3 the true value and the value Effect -size Null Hypothesis Alternative Hypothesis ANOVA
value specified in the null
specified in the null hypothesis.
hypothesis.
Time complexity of k medoids
57 3 O(n^2) O(nlogn) O(n) O(n^3) This is an explaination.
is?
Which algorithm aims at K means algorithm aims at
58 3 minimizing an objective function K-means Mean Difference Alternative Hypothesis ANOVA minimizing an objective function
know as squared error function know as squared error function
Which algorithm was the
Apriori Algorithm was earliest in
59 1 earliest of the association rule Apriori Algorithm Gaussian Algorithm K means clustering Bernoulli Distribution.
the association of algorithms.
algorithms?\n
The Apriori algorithm takes The Apriori algorithm takes a
a______ iterative approach to bottom-up iterative approach to
60 1 uncovering the frequent Bottom-Up Top-Down Recursive None uncovering the frequent itemsets
itemsets by first determining all by first determining all the
the possible items possible items
Apriori uses breadth-first search
Apriori uses which structure to
and a Hash tree structure to
61 1 count candidate item sets BFS DFS Queue Stack
count candidate item sets
efficiently?
efficiently
"y=a+b*x^2". This equation
62 1 Polynomial Regression Logistic Regreasion Linear Regression Lasso Regression This is an explaination.
shows which regression?
__________ is defined as the Confidence is defined as the
measure of certainty or measure of certainty or
63 2 Confidence Recursion Item-set None
trustworthiness associated with trustworthiness associated
each discovered rule. with\neach discovered rule.
In which Regression, we In Logistic Regression, we
64 2 Logistic Regression Linear Regression Both None
predict the value by 1 or 0? predict the value by 1 or 0.
The formula for linear The formula for linear regression
65 2 Y’ = bX+A Y’ = bX - A. Y’ = bX /A. Y’ = bX * A.
regression is: is: Y’ = bX + A.
Which regression is useful PLS regression is also useful
Partial Least Squares(PLS)
66 2 when there are a large number Cox Regression Lasso Regression Logistic Regression when there are a large number of
Regression
of independent variables. independent variables.
Which regression is an Simple linear regression is an
67 2 approach for predicting a Linear-Regression Logistic Regreasion Elasticnet Regression None approach for predicting a
response using a single feature. response using a single feature.
Association rule mining consists Association rule mining consists
68 2 2 3 4 5
of _______ steps. of 2 steps
Which type of regression is Ordinal regression is suitable
69 2 suitable when dependent Ordinal Regression Linear Regression Cox Regession Logistic Regression when dependent variable is
variable is ordinal in nature? ordinal in nature
Which regression is used for ElasticNet regression is used for
70 2 ElasticNet Regression Linear Regression Logistic Regression None
support vector machines support vector machines,
Which regression can solve Support-Vector Regession can
71 2 both linear and non-linear Support Vector Regression Linear Regression Logistic Regression ElasticNet Regression solve both linear and non linear
models? models.
Which is the most common Least Square Method is the most
72 2 method used for fitting a Least Square Method Mean Difference Null Hypothesis Classification common method used for fitting
regression line a regression line
_______problems are when A regression problem is when
73 2 the output variable is a real or Regression Classification Recursive Hypothesis the output variable is a real or
continuous value. continuous value.
Linear Regression is a machine
Linear Regression is a machine
learning algorithm based on
74 2 Supervised Learning Unsupervised Learning Recursive Learning All learning algorithm based on
______ learning regression
supervised regression algorithm.
model.
When dependent variable's
When dependent variable's
variability is not equal across
variability is not equal across
75 2 Heteroscedasticity Homooscedasticity Multicolinearity Outliers. values of an independent
values of an independent
variable, it is called
variable, it is called
heteroscedasticity
_________requires large Logistic Regression requires
sample sizes because maximum large sample sizes because
76 2 likelihood estimates are less Logistic Regression Linear Regression Lasso Regression ElasticNet Regression maximum likelihood estimates
powerful at low sample sizes are less powerful at low sample
than ordinary least square sizes than ordinary least square
PCR Regression is divided into PCR regression is divided into 2
77 2 2 3 4 5
how many steps? steps
78 3 L2 regularization is also called? Tikhonov Regularization Norm Regularization Poisson's Regularization None This is an explaination.
When the variance of count When the variance of count data
79 3 data is greater than the mean Overdispersion Underdispersion Dispersion High dispersion is greater than the mean count, it
count, it is a case of? is a case of overdispersion
OptimusPrime Page 98
marks question A B C D ans
Which regression assumes the Linear regression assumes the
80 3 normal distribution of the Linear-Regression Logistic Regreasion Elasticnet Regression None normal or gaussian distribution of
dependent variable? the dependent variable.
Nature of predicted data in Nature of predicted data in
81 3 Ordered Unordered Both None
regression is? regression is ordered.
Which regression uses a binary Logistic regression uses a binary
82 3 dependent variable but ignores Logistic Regression Linear Regression Cox Regession Lasso Regression dependent variable but ignores
the timing of events. the timing of events.
The Ridge Regression is also The ridge regression is also
83 3 Shrinkage Regression Percentile Regression Elasticnet Regression Lasso Regression
known as? known as Shrinkage Regression.
In which regression, we In Linear Regession we calculate
calculate Root Mean Square Root Mean Square
84 3 Linear-Regression ElasticNet Regression Logistic Regression All
Error(RMSE) to predict the Error(RMSE) to predict the next
next weight value. weight value.
The______ is the standard The residual standard error is the
85 3 deviation of the observed Residual standard error Mean Difference Error Data Error All standard deviation of
residuals. the\nobserved residuals.
Which Regression is used Poisson regression is used when
86 3 when dependent variable has Poisson Regression Linear Regression Cox Regession Lasso Regression dependent variable has count
count data. data.
________________regression
Quasi-Poisson regression can
can handle both over-
87 3 Quasi-Poisson regression Cox Regression Elasticnet Regression Linear Regression handle both over-dispersion and
dispersion and under-
under-dispersion.\n
dispersion.\n
___ is the regularization
λ is the regularization parameter
88 3 parameter in Lasso λ θ Ω β
in lasso regression.
Regression?
Decision Tree is a hierarchical Decision Tree is a hierarchical
model that does the separation model that recursively does the
89 1 Recursion Pointers Greedy Approach Divide and Conquer
of the\ninput space into class separation of the\ninput space
regions using: into class regions
Learning Algorithm of Decision Decision Tree uses greedy
90 1 Greedy Approach Divide and Conquer Both None
Tree is: approach for learning algorithm.
Normal Distribution is also
91 1 Gausiann Distribution Bernoulli Distribution Naïve Bias Binary Distribution This is an explaination.
called?
Classification has how many There are 2 phases of
92 1 2 3 4 5
phases: classification.
"Every pair of features being Naïve Bias uses the principle that
classified is independent of every pair of features being
93 1 Naïve Bais Classifier Decision Tree Bernoulli Distribution Normal Distribution
each other".This principle is classified is independent of each
used by: other.
This equation is of which
94 2 Gausiann Distribution Binary Distribution Naïve Bias Gross-Entrpoy This is an explaination.
theorem?
In Naïve Bias, The Datasets
data sets are divided into two
95 2 are divided into how many 2 3 4 5
types in naïve bias.
types?
Decision trees can be used to Decision trees can be used to
96 2 predict non-categorical values Regression Trees Categorial trees Normal tree None predict non-categorical values is
is called? called regression trees
An attribute with____Gini
an attribute with lower Gini index
97 2 index should be preferred in a Lower Higher Recursive Negative
should be preferred.
decision tree.
In Naïve Bias, if any two If any two events A and B are
98 2 events A and B are P(A,B)=P(A)P(B) P(A,B)=P(A)/P(B) P(A,B)=P(B) P(A,B)=P(B)P/(A) independent,
independent, then, then,P(A,B)=P(A)P(B)
What is the measure of
Entropy is the measure of
99 2 uncertainty of a random Entropy. Gain Gini Index None
uncertainty of a random variable
variable in a decision tree.
Which of the following is not
100 2 Stable Easy to understand Easy to explain Easy to evaluate. this is an explaination.
true for decision trees?
Decision tree algorithm falls Decision tree algorithm falls
101 2 under the category of which Supervised Unsupervised Regression Classification under the category of supervised
learning? learning
False Positives and False One of the use Bayes Theorem is
102 2 Negatives is an application of Bayes' Theorem Binary Distribution Bernoulli Distribution Normal Distribution false positives and false
which theorem? negatives.
Decision Tree used in mining
There are 2 types of decision
103 2 the data are of how many 2 3 4 5
trees used in data mining.
types?
In Bayes' Theorem, P(A) and
P(A) and P(B) are the
P(B) are the probabilities of
probabilities of observing A and
104 3 observing A and B Marginal Probability Normal Distribution Bernoulli Distribution Parallel Algorithm.
B respectively; they are known
respectively; they are known OptimusPrime Page 99
as the marginal probability.
as:
marks question A B C D ans
ID3 Algorithm in a decision ID3 stands for Iterative
105 3 Iterative Dichotomiser 3 (ID3) Interval Driven Interconnected Decision None
tree stands for? Dichotomiser 3 (ID3)
Probably the best way of
Probably the best way of
estimating performance for very
106 3 estimating performance for Boot Strapped Method Normal Distribution Naïve Bias Binary Distribution
small data sets is bootstrapped
very small\ndata sets is:
method
The Decision Tree works on Decision Tree works on
107 3 Disjunctive Normal Form Product of Sum Bijective Form Conjuctive Form
which form? Disjunctive normal form.
The decoupling of the class The decoupling of the class
conditional feature distributions conditional feature distributions
108 3 means that each distribution 1-D 2-D 3-D NONE means that each distribution can
can be independently estimated be independently estimated as a
as a________ distribution. one dimensional distribution.
Theoretical concept to evaluate
109 3 COLT PAC Model Naïve Bias Prediction. This is an explaination.
Classfiers is:
____________is a metric to Gini Index is a metric to measure
measure how often a randomly how often a randomly chosen
110 3 Gini Index Entropy Pointer Gross-Entrpoy
chosen element would be element would be incorrectly
incorrectly identified identified
The most notable types of The most notable types of
111 3 3 2 1 4
decision tree algorithms are: decision tree algorithms are 3
Which process is completed The recursive partition is
when the subset at a node all completed when the subset at a
112 3 Recursive Partitioning Termination Transformation Prediction.
has the same value of the target node all has the same value of
variable? the target variable
The_______ method reserves The holdout method reserves a
113 3 a certain amount for testing and Holdout Parallel Algorithm Naïve Bias Normal Distribution certain amount\nfor testing and
uses the remainder for training. uses the remainder for training
This equation is of which
114 3 Bayes' Theorem Normal Distribution Bernoulli Distribution Gross-Entrpoy This is an explaination.
theorem?
"Independence among the Independence among the
115 3 features". This is an assumption Naïve Bais Classifier Bernoulli Distribution Parallel Algorithm Binary Distribution features is an assumption in
in: Naïve bias.
Error rate obtained from error rate obtained from training
116 3 Resubstitution Error Grid Gini Index True error
training data is called: data is called resubstitution error.
In Decision Tree entropy is
117 3 proportional inverse High Less This is an explaination.
__________ to content.
In Decision Tree, No root-to-
No root-to-leaf path should
leaf path should contain the
118 3 Twice Once Thrice Four Times. contain the same discrete
same discrete attribute
attribute twice
____________.
Using_________, designers
Using data visualization methods,
can make information
119 1 Data Visualization Classification Regression Supervised Learning. designers can make information
understandable for
understandable for stakeholders.
stakeholders.
The additional visual methods
120 1 All Tree Map Parallel Coordinates Semantic Networks. This is an explaination.
include:
Data Visualization tools
121 1 Ms--Excel Tableau Power BI Jupyter This is an explaination.
Doesn’t include:
Which of the following requires
122 1 Javascript Knowledge to run All Chart.js Polymap Sigmajs This is an explaination.
the visualization tool?
Merits of Tableau doesn’t Merits of tableau doesn’t include
123 1 Cost Performance Usage Computation
include which factor: the cost factor.
Which of these is not a type of
124 1 Pictograph Bar-Graph Line-Chart Pie-Chart This is an explaination.
Big Data Visualization.
The drag-and-drop editor od
The drag-and-drop editor of
which tool makes it easy to
Infogram makes it easy to create
125 2 create professional-looking Infogram Google Chart Tableau Grafana
professional-looking designs
designs without a lot of visual
without a lot of visual design skill.
design skill.
How many V's are defined for There are 4 V's of Data
126 2 4 6 2 3
Data Visualization. visualization.
Which of the following is not a Tableau is a chargeable tool of
127 2 Tableau Google Chart Jupyter Hub-Spot CRM
free Data Visualization tool? data visualization.
Companies that work with
Companies that work with both
both traditional and big data
traditional and big data may use
128 2 use which technique to look at Pie-Chart Bar-Graph Stream graph Line-Chart
pie chart to look at customer
customer segments or market
segments or market shares
shares?
Visualization of Data includes
129 2 which of the following All Information Loss Visual Noise Large Image Perception. This is an explaination.
problems: OptimusPrime Page 100
Mainly, Data Visualization has There are 5 main challenges to
130 2 5 6 4 2
how many types of challenges? data visualization.
marks question A B C D ans
Google charts uses
Which tool uses HTML5/SVG
131 2 Google Charts Jupyter Grafana Tableau HTML5/SVG since its browser
to visualize data
compatible.
According to Colin Ware’s According to Colin Ware’s
Information Visualization: Information Visualization:
132 2 Perception for Design, he 4 2 1 3 Perception for Design, he defines
defines_____ pre-attentive four pre-attentive visual
visual properties. properties
_____ is based on space-filling Tree map method is based on
133 2 visualization of hierarchical Tree-Map Stream graph Bar-graph Line-Chart space-filling visualization of
data. hierarchical data
Which graph shows the Gantt chart show the
dependency relationships dependency relationships
134 2 Gantt-Chart Line-Chart Pie-Chart Bar-Graph
between activities and current between activities and current
schedule status. schedule status.
Another name for distribution Non parametric data is also
135 2 Non parametric data Parametric Data static data Dynamic data
free data is: called distribution free data.
Which chart is used for Bar Graph is used for
comparison of values, such as Comparison of values, such as
136 2 sales performance for several Bar-Graph Gantt-Graph Line-Chart Pie-Chart sales performance for several
persons or businesses in a persons or businesses in a single
single time. time
Graphical Techniques are
_____________are graphics
graphics in the field of statistics
137 2 in the field of statistics used to Graphical-Techniques Line-Chart Regression Classification
used to visualize quantitative
visualize quantitative data.
data.
_____ can handle several Parallel Coordinates can handle
factors for a large number of several factors for a large
138 2 objects per single screen, so it Parallel Coordinates Stream graph Google Chart Jupyter number of objects per single
satisfies the data variety screen, so it satisfies the data
criterion. variety criterion
Chart.js provides how many
139 3 8 5 3 6 This is an explaination.
types of charts?
Which visualization tool
Grafana supports mixed data
supports mixed data sources,
sources, annotations, and
annotations, and customizable
140 3 Grafana Tableau Google Chart Jupyter customizable alert functions, and
alert functions, and it can be
it can be extended via hundreds
extended via hundreds of
of available plugins.
available plugins.
Which tool was created Datawrapper was created
141 3 specifically for adding charts Data Wrapper Tableau Google Chart Jupyter specifically for adding charts and
and maps to news stories. maps to news stories.
Conventional Visualization Mekko chart is a new technique
142 3 Mekko Chart Pie-Chart Bar-graph Histogram
methods doesn’t include: to visualize data.
_____________ is a type of a Streamgraph is a type of a
stacked area graph, which is stacked area graph, which is
143 3 displaced around a central axis, Streamgraph Bar-Graph Pie-Chart Line-Chart displaced around a central axis,
resulting in flowing and organic resulting in flowing and organic
shape. shape
Which visual tool includes over
Fusion charts includes over 150
144 3 150 chart types and 1,000 Fusion charts Tableau Google Chart Jupyter
chart types and 1,000 map types
map types?
Which graph/chart is a
A semantic network is a
graphical representation of
graphical representation of
logical relationship between
logical relationship between
different concepts. It generates
145 3 Semantic Networks Bar-Graph Pie-Chart Line-Chart different concepts. It generates
directed graph, the
directed graph, the combination
combination of nodes or
of nodes or vertices, edges or
vertices, edges or arcs, and
arcs, and label over each edge
label over each edge.
According to SAS we can According to SAS we can
process only______ of process only 1 kilobit of
146 3 1 Kilobit 1 Byte 1 Bit 1 MB
information per second on a information per second on a flat
flat screen. screen
There are____ steps for
147 3 4 5 3 6 This is an explaination.
interactive data visualization:
When working with big data, When working with big data,
companies can use which companies can use the line chart
visualization technique to track visualization technique to track
148 3 total application clicks by Line-Chart Bar-Graph Pie-Chart Stream graph total application clicks by weeks,
weeks, the average number of the average number of
complaints to the call center by complaints to the call center by
months, etc.\n\n months, etc.\n\n
Which of the following
149 1 All Facebook Netflix Adobe This is an explaination.
Enterprises use HBase? OptimusPrime Page 101
marks question A B C D ans
Which NLP is used in the From 2010, Neural NLP is
150 1 Neural NLP Symbolic NLP Statical NLP None
present era? being used.
The Computer World magazine The Computer World magazine
states that unstructured states that unstructured
151 1 information might account for 70-80% 0.9 0.5 0.6 information might account for
more than______of all data in more than 70%–80% of all data
organizations. in organizations.
Almost all of the information Almost all of the information we
we use and share every day, use and share every day, such as
152 1 such as articles, documents and Unstructured Structured Semantic None articles, documents and e-mails,
e-mails, are are completely or partly
completely___________. unstructured
The Unstructured Information
Which standard provided a Management Architecture
common framework for (UIMA) standard provided a
Unstructured Information
processing information to Management common framework for
153 1 Management Architecture Data Architecure None
extract meaning and create Architecture for Data processing this information to
(UIMA)
structured data about the extract meaning and create
information? structured data about the
information.
The base Apache Hadoop The base Apache Hadoop
154 2 framework is composed of the 4 2 3 6 framework is composed of the
how many modules? four modules.
No-SQL doesn’t include
155 2 MS-SQL HBASE DyanoDB MongoDB This is an explaination.
which software?
There are _______main types There are 3 types of OLAP
156 2 3 2 5 6
of OLAP systems. systems.
SQL alternative in Apache HIVE-QL is the alternative to
157 2 HIVEQL BASEQL SPARK-QL H-QL
HIVE is called? SQL in Apche Hive family.
MapReduce program executes MapReduce program executes in
158 2 3 2 5 4
in how many stages? three stages.
How many types of NO-SQL There are 4 types of databases in
159 2 4 3 2 6
database are there? NO-SQL.
MapReduce is a processing
MapReduce is a processing
technique and a program
technique and a program model
160 2 model for distributed JAVA Python C++ R
for distributed computing based
computing based on which
on java
programming Language?
Hive supports how many Hive supports all four properties
161 2 4 3 2 1
properties of transactions? of transactions
HDFS consists of only one
HDFS consists of only one
162 2 Master Node Slave Node Both None Name Node that is called the
Name Node that is called as?
Master Node.
Which Apache Software is
needed to process massive Hbase to process massive
163 2 amounts of data for the Apache HBASE Apache Spark Apache-PIG Apache-mahout amounts of data for the purposes
purposes of natural-language of natural-language search
search?
Which database store data in a No-sql databases that store data
164 2 format other than relational NO-SQL HIVESQL SPARK-QL H-QL in a format other than relational
tables tables.
Which is a project of the Mahout is a project of the
Apache Software Foundation Apache Software Foundation to
to produce free produce free implementations of
165 2 implementations of distributed Apache Mahout Apache Spark Apache-PIG Apache HBASE distributed or otherwise scalable
or otherwise scalable machine machine learning algorithms
learning algorithms focused focused primarily on linear
primarily on linear algebra? algebra.
MapReduce model is a
Which model is a specialization
specialization of the split-apply-
166 2 of the split-apply-combine MapReduce Hadoop HBASE HIVE
combine strategy for data
strategy for data analysis?
analysis.
All Hadoop commands are
All Hadoop commands are invoked by the
167 2 $HADOOP_HOME/bin/hadoop $HADOOP/bin/hadoop $HADOOP_HOME/hadoop $HADOOP_HOME/bin
invoked by which command? $HADOOP_HOME/bin/hadoop
command
The table typically enforces the The table typically enforces the
schema when the data is schema when the data is loaded
loaded into the table. This into the table. This enables the
enables the database to make database to make sure that the
168 3 sure that the data entered Schema on Write Schema on Read Schema for Read Write None data entered follows the
follows the representation of representation of the table as
the table as specified by the specified by the table definition.
table definition. This design is This design is called schema on
called? OptimusPrime write. Page 102
marks question A B C D ans
Which command formats the Namenode -format command
169 3 Namenode -format Node -format Name -format Format
DFS filesystem? formats the DFS file system.
Which command applies the
oiv applies the offline fsimage
170 3 offline fsimage viewer to an oiv fs fc ov
viewer to an fsimage.
fsimage?
Hadoop requires which Java
Hadoop requires Java Runtime
171 3 Runtime Environment (JRE) or 1.6 1.2 1.5 1
Environment (JRE) 1.6 or higher
higher version?
Every Data node sends a
Every Data node sends a
Heartbeat message to the
Heartbeat message to the Name
172 3 Name node every____ 3 2 4 1
node every 3 seconds and
seconds and conveys that it is
conveys that it is alive
alive.
HDFS can store upto1 TB of
173 3 HDFS can store files upto: 1 TB 1 GB 1ZB 1PB
files.
Which of the following is a HBASE is a popular wide
174 3 HBase SQL DyanoDB MongoDB
wide-column store? columnn store.
Which node acts as both a A slave or worker node acts as
175 3 DataNode and TaskTracker in Slave Node Data Node Admin Node Name Node both a DataNode and
Hadooop. TaskTracker.
HDFS system uses which HDFS system uses TCP/IP
176 3 TCP/IP TCP UDP IP
protocol for communication? sockets for communication
177 3 HDFS has how many services? 5 4 2 6 HDFS has five services.
____________is a data
HIVE is a data warehouse
warehouse software project
software project built on top of
178 3 built on top of Apache Hadoop Apache HIVE Apache Spark Apache-PIG Apache HBASE
Apache Hadoop for providing
for providing data query and
data query and analysis
analysis

OptimusPrime Page 103

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 2. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image
Q.no 3. The ---------- attribute speciﬁes the number of dimensions or axes of the
array.

A : ndarray.size

B : ndarray.dtype

C : ndarray.ndim

D : ndarray.axes

Q.no 4. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 5. ---------------- submodule of scipy is dedicated to image processing.

A : ndarray

B : spatial

C : ndimage

D : special

Q.no 6. If number of input features are 3 then optimal hyperplane in support

vector machine is -------------

A : Single point

B : Line

C : 2-D Plane

D : Non linear line

Q.no 7. --------------- is an example of human generated unstructured data.

A : Text ﬁles

B : Satellite data

C : Sensor data
D : Seismic imagery data

Q.no 8. -------------- must be installed before you use scikit learn

A : Matlab

B : Scilab

C : Scipy

D : Numpy

Q.no 9. The procedure to organize items of a given collection into groups based on
some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association

Q.no 10. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 11. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 12. ------------- function is used to plot a histogram using matplotlib library

A : hist()

B : bar()

C : pie()
D : scatter()

Q.no 13. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Probability

B : Gini Index

C : Regression

D : Association

Q.no 14. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 15. Which of the following is not a type of clustering algorithm?

A : Density clustering

B : K-Mean clustering

C : Centroid clustering

D : Simple clustering

Q.no 16. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 17. -------------- data does not ﬁts into a data model due to variatins in contents.

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 18. ---------------- function multiply two matrices in numpy.

A : prod()

B : mult()

C : dot()

D:*

Q.no 19. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 20. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 21. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1

D : -2

Q.no 22. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing

B : Data Integration
C : Data Replication

D : Data loading

Q.no 23. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 24. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 25. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 26. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 27. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support
C : Lift

D : None of These

Q.no 28. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 29. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 30. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()

C : merge()

D : add()

Q.no 31. Data generated on twitter is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 32. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas
B : Numpy

C : matplotlib

D : ndarray

Q.no 33. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 34. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 35. --------------------- is raster graphic format with lossless compression.

A : EPS

B : PDF

C : PNG

D : PS

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 37. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 38. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 39. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 40. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These

Q.no 41. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence
D : lift

Q.no 42. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 43. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 44. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 45. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 46. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree
B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 47. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 48. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 49. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 50. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Q.no 51. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 52. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 53. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 54. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing

D : Data modeling

Q.no 55. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left
Answer for Question No 1. is a

Answer for Question No 2. is c

Answer for Question No 3. is c

Answer for Question No 4. is d

Answer for Question No 5. is c

Answer for Question No 6. is c

Answer for Question No 7. is a

Answer for Question No 8. is c

Answer for Question No 9. is b

Answer for Question No 10. is c

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is b

Answer for Question No 15. is d

Answer for Question No 16. is b

Answer for Question No 17. is b

Answer for Question No 18. is c

Answer for Question No 19. is a

Answer for Question No 20. is b

Answer for Question No 21. is b

Answer for Question No 22. is a

Answer for Question No 23. is d

Answer for Question No 24. is d

Answer for Question No 25. is c

Answer for Question No 26. is b

Answer for Question No 27. is a

Answer for Question No 28. is c

Answer for Question No 29. is a

Answer for Question No 30. is c

Answer for Question No 31. is b

Answer for Question No 32. is c

Answer for Question No 33. is a

Answer for Question No 34. is a

Answer for Question No 35. is c

Answer for Question No 36. is a

Answer for Question No 37. is a

Answer for Question No 38. is d

Answer for Question No 39. is a

Answer for Question No 40. is b

Answer for Question No 41. is a

Answer for Question No 42. is d

Answer for Question No 43. is b

Answer for Question No 44. is d

Answer for Question No 45. is b

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is d

Answer for Question No 49. is d

Answer for Question No 50. is c

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is b

Answer for Question No 54. is a

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is a

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 2. ----------- data that depends on data model and resides in a ﬁxed ﬁeld within
a record.

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 3. ---------- plot displays information as series of data points connected by
straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 4. ---------------- is about developing code to enable the machine to learn to

perform tasks and its basic principle is the automatic modeling of underlying that
have generated the collected data.

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 5. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 6. ---------------- method is dataframe reads ﬁrst n rows from dataframe

A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 7. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()
D : numpy.rad2sin(x1)

Q.no 8. Apriori algorithm is --------------- machine learning algorithm.

A : Un- Supervised

B : Supervised

C : Both of these

D : None of These

Q.no 9. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn

B : Pandas

C : Matplotlib

D : Numpy

Q.no 10. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 11. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 12. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised
C : Association

D : correlation

Q.no 13. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 14. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 15. ----------- is supervised machine learning algorithm outputs an optimal

hyperplane for given labled training data

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 16. ------------ rule mining is a technique to identify underlying relations

between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 17. -------------type of analytics descibes what happened in past

A : Descriptive
B : Prescriptive

C : Predictive

D : Probability

Q.no 18. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 19. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 20. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 21. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 22. ------- is basic data structure of pandas can be think of SQL table or a
spreadsheet data representation.
A : Dataframe

B : series

C : list

D : ndarray

Q.no 23. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 24. A perfect negative correlation is signiﬁed by -------------

A:1

B : -1

C:0

D:2

Q.no 25. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 26. In matplotlib library ------------- module supports basic image loading,
rescaling and display operations.

A : picture

B : image

C : pyplot

D : sympy
Q.no 27. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 28. ---------- is unsupervised technique aiming to divide a multivariate dataset

into clusters or groups.

A : KNN

B : Support Vector Machines

C : Regression

D : Cluster analysis

Q.no 29. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 30. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these

Q.no 31. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1
D : -2

Q.no 32. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 33. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 34. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 35. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 36. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees
D : Cluster analysis

Q.no 37. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 38. ------------------- is a one dimensiional array deﬁned in pandas that can be
used to store any data type.

A : Dict

B : series

C : ndarray

D : list

Q.no 39. To read image from a ﬁle into an array --------------- function is used.

A : matplotlib.pyplot.imshow()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 40. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 41. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous
C : Regressand

D : Estimated

Q.no 42. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 43. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 44. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 45. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 46. Which function from numpy used to return the truncated value of the
input elementwise?
A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 47. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 48. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 49. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 50. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift
Q.no 51. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 52. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 53. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 54. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 55. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows
D : ncols

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 57. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 59. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()

Q.no 60. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities
A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability
Answer for Question No 1. is c

Answer for Question No 2. is a

Answer for Question No 3. is b

Answer for Question No 4. is b

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is d

Answer for Question No 11. is d

Answer for Question No 12. is b

Answer for Question No 13. is a

Answer for Question No 14. is a

Answer for Question No 15. is b

Answer for Question No 16. is d

Answer for Question No 17. is a

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is c

Answer for Question No 25. is a

Answer for Question No 26. is b

Answer for Question No 27. is c

Answer for Question No 28. is d

Answer for Question No 29. is a

Answer for Question No 30. is c

Answer for Question No 31. is b

Answer for Question No 32. is c

Answer for Question No 33. is a

Answer for Question No 34. is d

Answer for Question No 35. is c

Answer for Question No 36. is d

Answer for Question No 37. is d

Answer for Question No 38. is b

Answer for Question No 39. is b

Answer for Question No 40. is c

Answer for Question No 41. is a

Answer for Question No 42. is c

Answer for Question No 43. is a

Answer for Question No 44. is d

Answer for Question No 45. is b

Answer for Question No 46. is b

Answer for Question No 47. is b

Answer for Question No 48. is a

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is b

Answer for Question No 53. is b

Answer for Question No 54. is d

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is b

Answer for Question No 59. is c

Answer for Question No 60. is a

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 2. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation
Q.no 3. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 4. To save or write dataframe data into csv ﬁle -------- function is used

A : write_csv()

B : write_ﬁle()

C : csv_read()

D : to_csv()

Q.no 5. ------------ uses a tree structure to specify sequences ofdecisions and

consequences.

A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 6. ---------------- is about developing code to enable the machine to learn to

perform tasks and its basic principle is the automatic modeling of underlying that
have generated the collected data.

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 7. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()
D : numpy.rad2sin(x1)

Q.no 8. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 9. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 10. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 11. Unsupervised learning makes sense of ------------- data without having any
predeﬁned dataset for its training.

A : unlabled

B : labeled

C : semi-labled

D : Empty dataset

Q.no 12. Correlation coeﬃcient values lies between----- and ---

A : -1 and +1

B : -1 and 0
C : 0 and 1

D : 0 and inﬁnite

Q.no 13. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised

C : Association

D : correlation

Q.no 14. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 15. ------------ type of plots show all individual data points without connected
with lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 16. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 17. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Information Gain
B : Probability

C : Regression

D : Association

Q.no 18. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 19. -------------- charts represents categorical data with retangular bars

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 20. In correlation both values are always---------------

A : Random

B : sequential

C : Same

D : from same group

Q.no 21. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 22. A ---------- is an example of the most widely used machine learning
algorithms much of its popularity is because it can be adapted to almost any type
od data.
A : Clustering

B : Regression

C : Decision trees

D : Apriori

Q.no 23. ------ is a classiﬁcation technique relies on the naïve assumption that
input variables are independent of each other.

A : KNN

B : NAïve Bayes

C : Regression

D : Support vector machine

Q.no 24. ----------- phase of the data analytics lifecycle usually takes the longest
time.

A : Data Preparation

B : Model Planning

C : Model Building

D : Communicate Results

Q.no 25. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas

B : Numpy

C : matplotlib

D : ndarray

Q.no 26. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these
Q.no 27. Which statement will create 5 x 5 array ﬁlled with all values 1

A : x=numpy.ones((5,5))

B : x=numpy.ones(5)

C : x=numpy.zeros((5,5))

D : x=numpy.eye((5,5))

Q.no 28. Which function returns the identity array with n x n dimension with its
main diagonal set to ones and all other elements to zero.

A : numpy.ones()

B : numpy.zeros()

C : numpy.ﬁll()

D : numpy.identity()

Q.no 29. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 30. In this type of clustring each data type either belongs to acluster
completely or not.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 31. ---------- function used to add two numppy arrays elementwise.

A : numpy.add(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)
D : numpy.addition(x1,x2)

Q.no 32. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line

Q.no 33. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 34. If a=np.array([1,2,3,4,5,6,7,8,9,10]) then a[2,5,1] will produce output----------

--------

A : 3, 4, 5

B : 3,4,5,6

C : 2,3,4,5

D : 1,2,3,4,5

Q.no 35. Identify the machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 36. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning
C : Data Visualization

D : software tester

Q.no 37. --------------------- is raster graphic format with lossless compression.

A : EPS

B : PDF

C : PNG

D : PS

Q.no 38. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 39. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 40. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 41. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right
B : on

C : sort

D : how

Q.no 42. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 43. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 44. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 45. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 46. Which of the following statement will create an axes at the top right
corner of the current ﬁgure
A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 47. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 48. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 49. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 50. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4
Q.no 51. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 52. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting

D : broadcasting

Q.no 53. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 54. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 55. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows
D : ncols

Q.no 56. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 58. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 59. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()
Q.no 60. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()
Answer for Question No 1. is a

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is d

Answer for Question No 5. is b

Answer for Question No 6. is b

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is d

Answer for Question No 10. is b

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is b

Answer for Question No 15. is c

Answer for Question No 16. is d

Answer for Question No 17. is a

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is c

Answer for Question No 22. is c

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is d

Answer for Question No 29. is b

Answer for Question No 30. is a

Answer for Question No 31. is a

Answer for Question No 32. is c

Answer for Question No 33. is c

Answer for Question No 34. is a

Answer for Question No 35. is d

Answer for Question No 36. is d

Answer for Question No 37. is c

Answer for Question No 38. is d

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is d

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is d

Answer for Question No 48. is a

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is c

Answer for Question No 54. is c

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is b

Answer for Question No 59. is c

Answer for Question No 60. is d

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Apriori algorithm is --------------- machine learning algorithm.

A : Un- Supervised

B : Supervised

C : Both of these

D : None of These

Q.no 2. CCTV footaage is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 3. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 4. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. The leaf nodes in decision trees returns the ---------

A : decision condition

B : class lables

C : decision on variables

D : test score

Q.no 6. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 7. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()
Q.no 8. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 9. -------------function reads an image from a ﬁle as an array.

A : imsave()

B : imread()

C : read()

D : None of these

Q.no 10. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 11. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 12. In numpy array , array indices always starts from --------

A:1

B : -1

C:0

D:2
Q.no 13. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 14. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 15. ------------ rule mining is a technique to identify underlying relations

between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 16. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 17. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 18. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 19. Which of the following is not a type of clustering algorithm?

A : Density clustering

B : K-Mean clustering

C : Centroid clustering

D : Simple clustering

Q.no 20. ---------- plot displays information as series of data points connected by
straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 21. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 22. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data
C : Text File data

D : Satellite imagery data

Q.no 23. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 24. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line

Q.no 25. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 26. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing

B : Data Integration

C : Data Replication

D : Data loading

Q.no 27. Which function returns the identity array with n x n dimension with its
main diagonal set to ones and all other elements to zero.
A : numpy.ones()

B : numpy.zeros()

C : numpy.ﬁll()

D : numpy.identity()

Q.no 28. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 29. ------------------ is an excellent 2D and 3D graphics library for generating

scientiﬁc ﬁgures?

A : Pandas

B : Numpy

C : matplotlib

D : ndarray

Q.no 30. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 31. A ---------- is an example of the most widely used machine learning
algorithms much of its popularity is because it can be adapted to almost any type
od data.

A : Clustering

B : Regression

C : Decision trees
D : Apriori

Q.no 32. Slop of the regression line of Y on X is also called as

A : Correlation coeﬃcient

B : Regression coeﬃcient

C : Association coeﬃcient

D : Probability

Q.no 33. -------- is the measure of the likeihood that an event will occure in a
random experiment

A : Probability

B : Correlation

C : Regression

D : Sample

Q.no 34. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 35. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 36. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()
C : merge()

D : add()

Q.no 37. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 38. Broadcasting is a powerful technique that allows numpy to work with
arrays of ------------- .

A : Same Shapes

B : Different Shapes

C : Same values

D : Different values

Q.no 39. If scatter diagram is drawn and all scatter points lie on a straight line
then it indicates-------

A : No correlation

B : Perfect correlation

C : Regression

D : Skewness

Q.no 40. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 42. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 43. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 44. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

A : Decision tree

B : Association Rule Mining

C : Clustering
D : Support vector machine

Q.no 46. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 47. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 48. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 49. For testing accuracy of a machine learning algorithm whole data set
should be devided into trainin and testing datasets. Which of the following is
good preportion for train-test spliting?

A : Train- 70%, Test - 30%

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 50. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()
B : trunc()

C : del()

D : remove_decimal()

Q.no 51. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 52. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 53. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 54. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree
Q.no 55. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 56. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 57. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 58. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 59. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()
C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 60. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()
Answer for Question No 1. is a

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is b

Answer for Question No 6. is c

Answer for Question No 7. is d

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is a

Answer for Question No 11. is c

Answer for Question No 12. is c

Answer for Question No 13. is a

Answer for Question No 14. is a

Answer for Question No 15. is d

Answer for Question No 16. is b

Answer for Question No 17. is b

Answer for Question No 18. is a

Answer for Question No 19. is d

Answer for Question No 20. is b

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is a

Answer for Question No 24. is c

Answer for Question No 25. is d

Answer for Question No 26. is a

Answer for Question No 27. is d

Answer for Question No 28. is c

Answer for Question No 29. is c

Answer for Question No 30. is b

Answer for Question No 31. is c

Answer for Question No 32. is b

Answer for Question No 33. is a

Answer for Question No 34. is a

Answer for Question No 35. is a

Answer for Question No 36. is c

Answer for Question No 37. is c

Answer for Question No 38. is b

Answer for Question No 39. is b

Answer for Question No 40. is d

Answer for Question No 41. is b

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is b

Answer for Question No 46. is a

Answer for Question No 47. is c

Answer for Question No 48. is d

Answer for Question No 49. is a

Answer for Question No 50. is b

Answer for Question No 51. is b

Answer for Question No 52. is a

Answer for Question No 53. is b

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is d

Answer for Question No 57. is d

Answer for Question No 58. is a

Answer for Question No 59. is c

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 2. The procedure to organize items of a given collection into groups based on
some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association
Q.no 3. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 4. -------- function is used to add a title to each axis instance in a ﬁgure.

A : set_title()

B : get_title()

C : set_label()

D : title()

Q.no 5. ---------- provides arange of supervised and un-supervised learning

algorithms via consistant interface in python

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 6. The -------- function creates a 2-D array with diagonal values 1 and rest
values zeros.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 7. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing
D : Data Structures

Q.no 8. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()

B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 9. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 10. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 11. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 12. If number of input features are 3 then optimal hyperplane in support
vector machine is -------------

A : Single point

B : Line
C : 2-D Plane

D : Non linear line

Q.no 13. ---------------- method is dataframe reads ﬁrst n rows from dataframe

A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 14. ------------ uses a tree structure to specify sequences ofdecisions and
consequences.

A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 15. ----------------- analysis estimates the relationship between single dependent
variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 16. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 17. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn
B : Pandas

C : Matplotlib

D : Numpy

Q.no 18. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 19. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 20. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 21. Which of the following plots is not used for multidimensional
visualization?

A : Andrrews Curves

B : Prallel Chart

C : Deviation Chart

D : Bar

Q.no 22. -------- is the measure of the likeihood that an event will occure in a
random experiment
A : Probability

B : Correlation

C : Regression

D : Sample

A : Apriori

B : K-Nearest Neighbors

C : K-Means

D : Decision Trees

Q.no 24. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1

C:0

D:2

Q.no 25. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 26. To set x Axis lable of a ﬁgure----------- function is used

A : set_title()

B : set_lable()

C : set_xlabel()
D : get_xlabel()

Q.no 27. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1

D : 10

Q.no 28. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 29. ------------ is an indication of how frequently the itemset appears in the
dataset in association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 30. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 31. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive
C : Predictive

D : Probability

Q.no 32. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 33. Pandas provide ----------- function as the entry point for all standard
database join operations while merging two DataFrame objects.

A : concat()

B : replace()

C : merge()

D : add()

Q.no 34. ------------ is 2-D data structure deﬁned in pandas in which data arranged in
rows and columns.

A : Series

B : Dataframe

C : ndarray

D : list

Q.no 35. ------------ is an example of semi structured data

A : NoSQL data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 36. ------------the step is performed by data scientist after acquiring the data.

A : Data Cleansing
B : Data Integration

C : Data Replication

D : Data loading

Q.no 37. Entropy is a measure of the randomness in the information being

processed.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 38. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 39. ------- is basic data structure of pandas can be think of SQL table or a
spreadsheet data representation.

A : Dataframe

B : series

C : list

D : ndarray

Q.no 40. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These
Q.no 41. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 42. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 43. In dataframe to compute summary statistics like mean, standard

deviation, min and max count etc for each numerical column ---------- function is
used.

A : display()

B : head()

C : describe()

D : sort()

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 46. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting

D : broadcasting

Q.no 47. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 49. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left
Q.no 50. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 51. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 52. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 53. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows

D : ncols

Q.no 54. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution
D : Probability

Q.no 55. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 56. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 57. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 58. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 59. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree
B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 60. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative
Answer for Question No 1. is b

Answer for Question No 2. is b

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is c

Answer for Question No 6. is c

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is a

Answer for Question No 11. is b

Answer for Question No 12. is c

Answer for Question No 13. is a

Answer for Question No 14. is b

Answer for Question No 15. is a

Answer for Question No 16. is b

Answer for Question No 17. is a

Answer for Question No 18. is d

Answer for Question No 19. is b

Answer for Question No 20. is d

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is b

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is a

Answer for Question No 29. is b

Answer for Question No 30. is c

Answer for Question No 31. is a

Answer for Question No 32. is a

Answer for Question No 33. is c

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is a

Answer for Question No 37. is a

Answer for Question No 38. is b

Answer for Question No 39. is a

Answer for Question No 40. is b

Answer for Question No 41. is d

Answer for Question No 42. is b

Answer for Question No 43. is c

Answer for Question No 44. is b

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is d

Answer for Question No 52. is b

Answer for Question No 53. is a

Answer for Question No 54. is a

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is c

Answer for Question No 58. is d

Answer for Question No 59. is b

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 2. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 3. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 4. -------------- data does not ﬁts into a data model due to variatins in contents.

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 6. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 7. Which of the following function is used to create an array of speciﬁed

shape but ﬁlled with random values.

A : numpy.random.ran()

B : rank

C : random.ﬁll()
D : numpy.ﬁllrandom()

Q.no 8. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 9. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 10. The -------- function creates a 2-D array with all values 0 (zeros).

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 11. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 12. The -------- function creates a 2-D array with diagonal values 1 and rest
values zeros.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()
D : numpy.empty()

Q.no 13. Pandas provide ----------- method in order to get label based indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 14. The ---------- attribute speciﬁes the number of dimensions or axes of the
array.

A : ndarray.size

B : ndarray.dtype

C : ndarray.ndim

D : ndarray.axes

Q.no 15. In support vector machines if input features are 2 then the decision
boundries or hyperplane is ---------------.

A : 2-D plane

B : 3-D plane

C : Line

D : point

Q.no 16. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 17. ---- is an technique to learn from examples and experience, without being
explicitly programmed.

A : Machine Learning

B : Software Testing
C : Computer Science

D : Data mining

Q.no 18. ------------ means part of population chosen for participation in the study

A : Population

B : Sample

C : Association

D : Correlation

Q.no 19. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 20. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 22. What is correct syntax to generate inetegers between 10 to 30

A : x=numpy.arange(10,30)

B : x=numpy.array(10,30)

C : x=numpy.arange(10,31)

D : x=arange(10,31)

Q.no 23. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 24. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 25. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1

C:0

D:2

Q.no 26. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered
Q.no 27. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 28. Regression analysis -----------

A : Establishes a relationship between two variables

B : Establishes cause and effect

C : Measures growth

D : Measures demand for good

Q.no 29. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 30. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 31. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models
Q.no 32. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 33. If a=np.array([1,2,3,4,5,6,7,8,9,10]) then a[2,5,1] will produce output----------

--------

A : 3, 4, 5

B : 3,4,5,6

C : 2,3,4,5

D : 1,2,3,4,5

Q.no 34. Slop of the regression line of Y on X is also called as

A : Correlation coeﬃcient

B : Regression coeﬃcient

C : Association coeﬃcient

D : Probability

Q.no 35. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 36. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1
D : 10

Q.no 37. A perfect negative correlation is signiﬁed by -------------

A:1

B : -1

C:0

D:2

Q.no 38. ---------- is unsupervised technique aiming to divide a multivariate dataset

into clusters or groups.

A : KNN

B : Support Vector Machines

C : Regression

D : Cluster analysis

Q.no 39. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 40. ------------ is an example of semi structured data

A : XML data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 41. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max
C : nrows

D : ncols

Q.no 42. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 43. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 44. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 45. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 46. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem
A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 47. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing

D : Data modeling

Q.no 48. --------- is technique that duplicates smaller array to make dimensionality
and size of an array as the size and dimensionality of larger array.

A : Multiplation

B : Broadcasting

C : Addition

D : Flatten

Q.no 49. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 50. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Q.no 51. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 52. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 53. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 55. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support
C : Conﬁdence

D : lift

Q.no 56. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 57. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 58. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 59. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 60. Which of the following function is not used to iterate over the rows of the
DataFrame.
A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is a

Answer for Question No 4. is b

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is a

Answer for Question No 8. is a

Answer for Question No 9. is a

Answer for Question No 10. is b

Answer for Question No 11. is d

Answer for Question No 12. is c

Answer for Question No 13. is b

Answer for Question No 14. is c

Answer for Question No 15. is c

Answer for Question No 16. is a

Answer for Question No 17. is a

Answer for Question No 18. is b

Answer for Question No 19. is d

Answer for Question No 20. is d

Answer for Question No 21. is a

Answer for Question No 22. is c

Answer for Question No 23. is a

Answer for Question No 24. is d

Answer for Question No 25. is b

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is a

Answer for Question No 29. is a

Answer for Question No 30. is a

Answer for Question No 31. is d

Answer for Question No 32. is b

Answer for Question No 33. is a

Answer for Question No 34. is b

Answer for Question No 35. is b

Answer for Question No 36. is b

Answer for Question No 37. is c

Answer for Question No 38. is d

Answer for Question No 39. is b

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is a

Answer for Question No 43. is d

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is b

Answer for Question No 50. is c

Answer for Question No 51. is b

Answer for Question No 52. is b

Answer for Question No 53. is a

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is d

Answer for Question No 58. is c

Answer for Question No 59. is b

Answer for Question No 60. is d

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Unsupervised learning makes sense of ------------- data without having any
predeﬁned dataset for its training.

A : unlabled

B : labeled

C : semi-labled

D : Empty dataset

Q.no 2. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots
Q.no 3. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 4. ---------------- function multiply two matrices in numpy.

A : prod()

B : mult()

C : dot()

D:*

Q.no 5. If number of input features are 3 then optimal hyperplane in support

vector machine is -------------

A : Single point

B : Line

C : 2-D Plane

D : Non linear line

Q.no 6. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 7. ------ answers the questions like " How can we make it happen?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability
Q.no 8. Pandas provide ----------- method in order to get label based indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 9. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 10. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 11. The leaf nodes in decision trees returns the ---------

A : decision condition

B : class lables

C : decision on variables

D : test score

Q.no 12. Sattelite image is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 13. The -------- function creates a 2-D array with all values 0 (zeros).

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 14. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 15. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 16. ------------- is fundamental library used for scientiﬁc computing

A : Pandas

B : Numpy

C : Sympy

D : Scipy

Q.no 17. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees
D : Cluster analysis

Q.no 18. ----------- is supervised machine learning algorithm outputs an optimal

hyperplane for given labled training data

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 19. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()

B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 20. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 21. JSON ﬁle data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 22. -------- is most important language for Data Science.

A : Java

B : Ruby
C:R

D : None of these

Q.no 23. ------------ is 2-D data structure deﬁned in pandas in which data arranged in
rows and columns.

A : Series

B : Dataframe

C : ndarray

D : list

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 25. Which of the following is not used for 2-D Visualisation?

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 26. The -------- of a numpy array is a tuple of integers giving the size of the
array along each dimension.

A : axes

B : rank

C : shape

D : size

Q.no 27. Pandas provide ----------- method in order to get purly integer based
indexing.
A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 28. --------- in decision tree measures how much information a feature gives us
about the class

A : Information Gain

B : Posterior probability

C : Prior probability

D : probability

Q.no 29. The process by which we estimate value of dependent variable on the
basis of one or more independent variables is called as -----------

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 30. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 31. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays
D : Apriori

Q.no 32. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 33. The objective of --------- algorithm is to ﬁnd a hyperplane in an N-

dimensional space that distinctly classiﬁes the data points.

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 34. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 35. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 36. In matplotlib ------------- function groups smaller axes that can exist
togather within a single ﬁgure.

A : subplot()
B : divide_ﬁgure()

C : add_ﬁg()

D : group_ﬁg()

Q.no 37. ------------- function is used to save an array as in image ﬁle.

A : matplotlib.pyplot.image()

B : matplotlib.pyplot.imread()

C : matplotlib.pyplot.imwrite()

D : matplotlib.pyplot.imsave()

Q.no 38. Entropy is a measure of the randomness in the information being

processed.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 39. ---------- function used to add two numppy arrays elementwise.

A : numpy.add(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.addition(x1,x2)

Q.no 40. In this type of clustring each data type either belongs to acluster
completely or not.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 41. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------
A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

Q.no 42. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 43. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 44. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 45. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)
Q.no 46. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()

Q.no 47. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 48. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 49. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 50. --------------- is basically extracting particular set of elements from an array.

A : Slicing

B : indexing

C : sorting
D : broadcasting

Q.no 51. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 52. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 53. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 54. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 55. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes
B : Canvas

C : Figure

D : FigureCanvas

Q.no 56. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 57. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 58. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()
Answer for Question No 1. is a

Answer for Question No 2. is c

Answer for Question No 3. is a

Answer for Question No 4. is c

Answer for Question No 5. is c

Answer for Question No 6. is a

Answer for Question No 7. is b

Answer for Question No 8. is b

Answer for Question No 9. is a

Answer for Question No 10. is a

Answer for Question No 11. is b

Answer for Question No 12. is b

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is a

Answer for Question No 16. is d

Answer for Question No 17. is d

Answer for Question No 18. is b

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is c

Answer for Question No 22. is c

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is c

Answer for Question No 26. is c

Answer for Question No 27. is a

Answer for Question No 28. is a

Answer for Question No 29. is b

Answer for Question No 30. is d

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is c

Answer for Question No 35. is d

Answer for Question No 36. is a

Answer for Question No 37. is d

Answer for Question No 38. is a

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is a

Answer for Question No 43. is b

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is c

Answer for Question No 47. is b

Answer for Question No 48. is d

Answer for Question No 49. is d

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is d

Answer for Question No 53. is b

Answer for Question No 54. is d

Answer for Question No 55. is d

Answer for Question No 56. is a

Answer for Question No 57. is c

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 2. ------------- function is used to plot a histogram using matplotlib library

A : hist()

B : bar()

C : pie()

D : scatter()
Q.no 3. ------------ rule mining is a technique to identify underlying relations
between different items.

A : Classiﬁcation

B : Regression

C : Clustering

D : Association

Q.no 4. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 5. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()

Q.no 6. In numpy array , array indices always starts from --------

A:1

B : -1

C:0

D:2

Q.no 7. Email data is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data
D : Scattered

Q.no 8. ---------- function used to get positive square root of an numppy array
elementwise.

A : numpy.sqrt(x1)

B : numpy.mod(x1)

C : numpy.square(x1)

D : numpy.ﬁnd(x1,2)

Q.no 9. In --------- learning the training is controlled by an external supervisor or

teacher.

A : Un- Supervised

B : Supervised

C : semi-supervied

D : group

Q.no 10. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 11. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 12. To import data from csv ﬁle into a dataframe ---------- function is provided
by pandas package.

A : read_csv()
B : read_ﬁle()

C : csv_read()

D : Frrom_csv()

Q.no 13. The -------- function creates a 2-D array with all values 1.

A : numpy. Ones()

B : numpy.zeros()

C : numpy.eye()

D : numpy.empty()

Q.no 14. K- nearest neighbors algorithm is based on -------------- learning

A : Un- Supervised

B : Supervised

C : Association

D : correlation

Q.no 15. In support vector machines if input features are 2 then the decision
boundries or hyperplane is ---------------.

A : 2-D plane

B : 3-D plane

C : Line

D : point

Q.no 16. ---------------- submodule of scipy is dedicated to image processing.

A : ndarray

B : spatial

C : ndimage

D : special

Q.no 17. ------------ uses a tree structure to specify sequences ofdecisions and
consequences.
A : Regression

B : Decision trees

C : KNN

D : SVM

Q.no 18. Numpy support this function to ﬁnd trigonometric sine elementwise .

A : numpy.sin()

B : numpy.cosine()

C : numpy.tangent()

D : numpy.rad2sin(x1)

Q.no 19. The procedure to organize items of a given collection into groups based
on some similar features called as -------------

A : Regression

B : Clustering

C : Ddecion Trees

D : Association

Q.no 20. matplotlib.pyplot.imread() function is used to ---------------

A : save image

B : read image

C : copy image

D : show image

Q.no 21. -------------- models search the data space for areas of varied density of data
points in the data space.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models
Q.no 22. Pandas provide ----------- method in order to get purly integer based
indexing.

A : iloc()

B : loc()

C : ix()

D : xloc()

Q.no 23. To rotate an image -------- function is used from scipy library.

A : rotation()

B : scipy.move()

C : scipy.ndimage.rotate()

D : scipy.ﬂip()

Q.no 24. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 25. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 26. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 27. ------------- regression ﬁnds a relaitionship between one or more features
(independent variables) and a continuous variables (dependent variable).

A : Non-linear

B : Linear

C : Both of these

D : None of These

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 29. Which of the following is not used for 2-D Visualisation?

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 30. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 31. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 32. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 33. A ------------ is a supervised machine learning algorithm which relies on the
assumptiion of feature independent to classify input data.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 34. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 35. In this type of algorithms inputs are provided but not the desired output.

A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 36. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support
C : Lift

D : None of These

Q.no 37. --------- function from matplotlib.pyplot library plots bar graph for given
values of x and y.

A : plot()

B : draw()

C : bar()

D : linedraw()

Q.no 38. To set x Axis lable of a ﬁgure----------- function is used

A : set_title()

B : set_lable()

C : set_xlabel()

D : get_xlabel()

Q.no 39. What is the use of following function? Plt.xlabel("Total Marks")

A : Gives label to X-Axis

B : Gives label to Y-Axis

C : Gives title to ﬁgure

D : Add text to ﬁgure

Q.no 40. In SciPy ---------- submodule is dedicated to image processing.

A : ndimage

B : ndarray

C : signal

D : io

Q.no 41. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.

A : Decision tree
B : Hash tree

C : Red-Black Tree

D : AVL Tree

Q.no 42. Which of the following task is not performed by Data Scientist.

A : Deﬁne the question

B : Create reproducible code

C : Challenge results

D : Staff Recruitement

Q.no 43. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 44. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 45. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

Q.no 46. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.
A : extract()

B : transform()

C : infer()

D : classify()

Q.no 47. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 48. When there is no impact on one variable when increse or decrese on
other variable then it is ------------

A : Perfect correlation

B : No Correlation

C : Positive Correlation

D : Negative Correlation

Q.no 49. For testing accuracy of a machine learning algorithm whole data set
should be devided into trainin and testing datasets. Which of the following is
good preportion for train-test spliting?

A : Train- 70%, Test - 30%

B : Train- 50%, Test - 50%

C : Train- 30%, Test - 70%

D : Train- 100%, Test - 00%

Q.no 50. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN
D : None of These

Q.no 51. Plot_number parameter from subplot() function can range from 1 to ------

A : nrows*ncols

B : max

C : nrows

D : ncols

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 53. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 54. In this type of clustring instead of putting each data point into a separate
cluster a probability or likelihood of that data point to be in those clusters is
assigned.

A : Hard clustering

B : Soft Clustering

C : Medium clustering

D : Simple clustring

Q.no 55. In regression the dependent variable is also called as ------------

A : Regression
B : Continuous

C : Regressand

D : Independent

Q.no 56. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 57. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 59. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()
D : subplot()

Q.no 60. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is d

Answer for Question No 4. is a

Answer for Question No 5. is d

Answer for Question No 6. is c

Answer for Question No 7. is b

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is c

Answer for Question No 11. is d

Answer for Question No 12. is a

Answer for Question No 13. is a

Answer for Question No 14. is b

Answer for Question No 15. is c

Answer for Question No 16. is c

Answer for Question No 17. is b

Answer for Question No 18. is a

Answer for Question No 19. is b

Answer for Question No 20. is b

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is c

Answer for Question No 24. is d

Answer for Question No 25. is d

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is a

Answer for Question No 29. is c

Answer for Question No 30. is a

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is c

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is a

Answer for Question No 37. is c

Answer for Question No 38. is c

Answer for Question No 39. is a

Answer for Question No 40. is a

Answer for Question No 41. is b

Answer for Question No 42. is d

Answer for Question No 43. is a

Answer for Question No 44. is a

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is a

Answer for Question No 51. is a

Answer for Question No 52. is b

Answer for Question No 53. is c

Answer for Question No 54. is b

Answer for Question No 55. is c

Answer for Question No 56. is d

Answer for Question No 57. is a

Answer for Question No 58. is b

Answer for Question No 59. is d

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. Correlation coeﬃcient values lies between----- and ---

A : -1 and +1

B : -1 and 0

C : 0 and 1

D : 0 and inﬁnite

Q.no 2. -------------type of analytics descibes what happened in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 3. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 4. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 5. -------------function reads an image from a ﬁle as an array.

A : imsave()

B : imread()

C : read()

D : None of these

Q.no 6. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees

D : Cluster analysis

Q.no 7. The ----------- algorithm is based on the fact that the algorithm uses prior
knowledge to ﬁnd frequent item set.

A : Clustring

B : Regression

C : Naïve Bays

D : Apriori

Q.no 8. Pin code of a city is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

Q.no 9. matplotlib.pyplot.imread() function is used to ---------------

A : save image

B : read image

C : copy image

D : show image

Q.no 10. Choose correct option for machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Sensor data

Q.no 11. Which function is used to give title for the axes.

A : plt.title()

B : plt.xlabel()

C : plt.ylabel()

D : plt.xscale()

Q.no 12. Which of the following is measure used in decision trees while selecting
splliting criteria that partitions data into the best possible manner.

A : Information Gain

B : Probability

C : Regression

D : Association

Q.no 13. ------------ means part of population chosen for participation in the study
A : Population

B : Sample

C : Association

D : Correlation

Q.no 14. ----------------- is an example of human generated unstructured data.

A : YouTube data

B : Satellite data

C : Sensor data

D : Seismic imagery data

Q.no 15. ------------- function is used to save image into an ndarray.

A : imsave()

B : imread()

C : save()

D : isave()

Q.no 16. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 17. ------- answers the question "What will happen in future?"

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 18. ---------------- method is dataframe reads ﬁrst n rows from dataframe
A : head(n)

B : tail(n)

C : ﬁrst(n)

D : start(n)

Q.no 19. ----------- referes to the graphical represetation of information and data.

A : Data Visualization

B : Data mining

C : Data warehousing

D : Data Structures

Q.no 20. -------------------- is a general purpose array-processing package provides a

high performance multi-dimentional array object and tools for working with
these arrays.

A : NumPy

B : SciPy

C : sklearn

D : None of these

Q.no 21. -------- is uses a tree structure to specify sequence of decisions and
consequences.

A : KNN

B : NAïve Bayes

C : Regression

D : Decision Tree

Q.no 22. Which statement will create 5 x 5 array ﬁlled with all values 1

A : x=numpy.ones((5,5))

B : x=numpy.ones(5)

C : x=numpy.zeros((5,5))

D : x=numpy.eye((5,5))
Q.no 23. In matplotlib library ------------- module supports basic image loading,
rescaling and display operations.

A : picture

B : image

C : pyplot

D : sympy

Q.no 24. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)

Q.no 25. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 26. -------- is most important language for Data Science.

A : Java

B : Ruby

C:R

D : None of these

A : Apriori

B : K-Nearest Neighbors
C : K-Means

D : Decision Trees

Q.no 28. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 29. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 30. --------------------- is a form of supervised learning algorithm which is used in

mail service providers like Gmail, yahoo, etc. to classify a new mail as spam or
not spam.

A : Classiﬁcation

B : Regression

C : Clustering

D : Naïve bays

Q.no 31. The number of iterations in apriori ---------------

A : increases with the size of the data

B : decreases with the increase in size of the data

C : increases with the size of the maximum frequent set

D : decreases with increase in size of the maximum frequent set

Q.no 32. In this type of algorithms inputs are provided but not the desired output.
A : Cluster analysis

B : Support Vector Machines

C : Decision trees

D : Naïve bays

Q.no 33. The objective of --------- algorithm is to ﬁnd a hyperplane in an N-

dimensional space that distinctly classiﬁes the data points.

A : KNN

B : Support Vector Machines

C : Regression

D : Decision Tree

Q.no 34. Which of the following is used as attribute selection measure in decision
tree algorithms?

A : Information Gain

B : Posterior probability

C : Prior probability

D : Support

Q.no 35. ----------- analysis ﬁnds the reasons behind success or failure in past

A : Descriptive

B : Prescriptive

C : Predictive

D : Probability

Q.no 36. A -----------------graph is a circular plot, divided into slices to show numerical
proportions.

A : Bar

B : Scatter

C : pie

D : line
Q.no 37. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 38. -----------is not one of the key data science skill.

A : Statistics

B : Machine Learning

C : Data Visualization

D : software tester

Q.no 39. ------------ is an indication of how frequently the itemset appears in the
dataset in association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 40. When data are collected in a statistical study for only a portion or subset
of all elements of interest we are using

A : Sample

B : Parameter

C : Population

D : Probability

Q.no 41. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset

C : Data preprocessing
D : Data modeling

Q.no 42. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 43. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 45. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 46. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------
A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 47. -------- is an unsupervised algorithm used for frequent itemset mining.

A : Apriori

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 48. Which function from numpy used to return the truncated value of the
input elementwise?

A : round()

B : trunc()

C : del()

D : remove_decimal()

Q.no 49. The strength (degree) of the correlation between a set of independent
variables X and a dependent variable Y is measured by-------------

A : Coeﬃcient of Correlation

B : Coeﬃcient of Determination

C : Standard error of estimate

D : Probability

Q.no 50. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()
Q.no 51. Which of the following statement will create an axes at the top right
corner of the current ﬁgure

A : subplot(2,3,3)

B : subplot(2,3,2)

C : subplot(2,3,4)

D : subplot(2,3,5)

Q.no 52. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 53. --------- function performs the custom operations for the entire dataframe.

A : function()

B : surutine()

C : rutine()

D : pipe()

Q.no 54. The --------- argument of merge function while merging two dataframes
speciﬁes which keys are to be included in the resulting dataframe.

A : right

B : on

C : sort

D : how

Q.no 55. Which of the following machine learning algorithm is used for maret
basket analysis means to analyze the association of purchased items in asingle
basket or single purchase.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 56. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These

Q.no 57. To save a figure into a file we can use ------------ method in the figure class
of matplotlib.pyplot.

A : save()

B : save_ﬁg()

C : Figure()

D : save_image()

Q.no 58. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 59. While plotting using matplotlib.pyplot A function call similar to

subplot(2,3,4) is

A : subplot(234)

B : subplot(243)

C : subplot(324)

D : subplot(4)

Q.no 60. Apriori algorithm uses breadth ﬁrst search and ------------structure to
count candidate item sets eﬃciently.
A : Decision tree

B : Hash tree

C : Red-Black Tree

D : AVL Tree
Answer for Question No 1. is a

Answer for Question No 2. is a

Answer for Question No 3. is c

Answer for Question No 4. is a

Answer for Question No 5. is b

Answer for Question No 6. is d

Answer for Question No 7. is d

Answer for Question No 8. is a

Answer for Question No 9. is b

Answer for Question No 10. is d

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is a

Answer for Question No 16. is d

Answer for Question No 17. is c

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is a

Answer for Question No 21. is d

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is b

Answer for Question No 25. is d

Answer for Question No 26. is c

Answer for Question No 27. is b

Answer for Question No 28. is b

Answer for Question No 29. is b

Answer for Question No 30. is a

Answer for Question No 31. is c

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is a

Answer for Question No 35. is a

Answer for Question No 36. is c

Answer for Question No 37. is a

Answer for Question No 38. is d

Answer for Question No 39. is b

Answer for Question No 40. is a

Answer for Question No 41. is a

Answer for Question No 42. is b

Answer for Question No 43. is a

Answer for Question No 44. is b

Answer for Question No 45. is a

Answer for Question No 46. is a

Answer for Question No 47. is a

Answer for Question No 48. is b

Answer for Question No 49. is a

Answer for Question No 50. is d

Answer for Question No 51. is a

Answer for Question No 52. is a

Answer for Question No 53. is d

Answer for Question No 54. is d

Answer for Question No 55. is b

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is a

Answer for Question No 60. is b

Seat No -
Total number of questions : 60

13329_DATA ANALYTICS
Time : 1hr
Max Marks : 50
N.B

1) All questions are Multiple Choice Questions having single correct option.

2) Attempt any 50 questions out of 60.

3) Use of calculator is allowed.

4) Each question carries 1 Mark.

5) Specially abled students are allowed 20 minutes extra for examination.

6) Do not use pencils to darken answer.

7) Use only black/blue ball point pen to darken the appropriate circle.

8) No change will be allowed once the answer is marked on OMR Sheet.

9) Rough work shall not be done on OMR sheet or on question paper.

10) Darken ONLY ONE CIRCLE for each answer.

Q.no 1. ----------------- analysis estimates the relationship between single dependent

variable and single independent variable

A : Simple Regression

B : Multiple regression

C : Correlation

D : Probability

Q.no 2. Find odd one out from the following :

A : KNN

B : NAïve Bayes

C : Decision Trees

D : Cluster analysis
Q.no 3. ------------ chart is a circular plot divides into sclices to show numerical
proportion.

A : Bar

B : Line

C : Scatter

D : Pie

Q.no 4. ------------ type of plots show all individual data points without connected
with lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 5. Which of the following is NOT supervised learning?

A : PCA

B : Decision Tree

C : Linear Regression

D : Naive Bayesian

Q.no 6. Probability always lies between ----- and ----

A : 0 and 1

B : -1 and +1

C : -1 and 0

D : 0 and inﬁnite

Q.no 7. In numpy array , array indices always starts from --------

A:1

B : -1

C:0
D:2

Q.no 8. To import data from excel ﬁle into a dataframe ---------- function is
provided by pandas package.

A : read_csv()

B : read_ﬁle()

C : read()

D : read_excel()

Q.no 9. ---------- plot displays information as series of data points connected by

straight lines.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 10. Which of the following is not a raster image ﬁle format?

A : PNG

B : JPG

C : BMP

D : PDF

Q.no 11. Naïve Bayes is a classiﬁcation technique based on ----------

A : Bayes Theorem

B : Pythagorous Theorom

C : Least square method

D : mean square method

Q.no 12. ---- is an technique to learn from examples and experience, without being
explicitly programmed.

A : Machine Learning

B : Software Testing
C : Computer Science

D : Data mining

Q.no 13. -------- library is built on the top of Numpy, SciPy and Matplotlib

A : Sympy

B : Scikit

C : Pandas

D : Numpy

Q.no 14. ------------- function is used to save image into an ndarray.

A : imsave()

B : imread()

C : save()

D : isave()

Q.no 15. For multidimensional visualization ---------------- are used.

A : pia charts

B : Bar charts

C : Andrews curves

D : Scatter plots

Q.no 16. ---------------- library from python provides eﬃcient versions of a large
number of machine learning algorithms.

A : Pandas

B : Numpy

C : Scikit-Learn

D : image

Q.no 17. In statistics, a population consists of -------------------

A : All People living in a country.

B : All People living in the city.

C : All subjects or objects whose characteristics are being studied.

D : Part of whole dataset

Q.no 18. Which library from python is used for implementing machine learning
algorithms?

A : Scikit-Learn

B : Pandas

C : Matplotlib

D : Numpy

Q.no 19. SQL record is an example of -----------

A : Structured data

B : Un-Structured data

C : Semi-Structured data

D : Scattered

A : Data Science

B : Data Analytics

C : Data Warehousing

D : Data mining

Q.no 21. -------- is the measure of the likeihood that an event will occure in a
random experiment

A : Probability

B : Correlation

C : Regression

D : Sample

Q.no 22. Entropy is a measure of the randomness in the information being

processed.
A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 23. In head()/tail()functions of dataframe the default number of elements to

display is --------

A:3

B:5

C:1

D : 10

Q.no 24. In SciPy ---------- submodule is dedicated to image processing.

A : ndimage

B : ndarray

C : signal

D : io

Q.no 25. ------ module from sklearn gathers popular unsupervised clustering
algorithms.

A : sklearn.covariance

B : sklearn.base

C : sklearn.neighbors

D : sklearn.cluster

Q.no 26. ---------- function used to get arrays elementwise remainder of division

A : numpy.divide(x1,x2)

B : numpy.mod(x1,x2)

C : numpy.true_divide(x1,x2)

D : numpy.reminder(x1,x2)
Q.no 27. Which of the following plots is not used for multidimensional
visualization?

A : Andrrews Curves

B : Prallel Chart

C : Deviation Chart

D : Bar

Q.no 28. --------------- searches for the linear optimal separating hyperplane for
separation of the data using essential training tuples called support vectors

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machines

Q.no 29. From matplotlib------------------ module is used for plotting various plots.

A : Scilearn

B : Pyplot

C : Scilab

D : Matlab

Q.no 30. In ------------ the x-axes are grouped into bins and each bin will be treated
as a category.

A : Bar

B : Line

C : Scatter

D : Histogram

Q.no 31. If X and Y are both independent of each other, then correlation
coeﬃcient is ---------

A:1

B : -1
C:0

D:2

Q.no 32. ----------- is an indication of how often the rule has been found to be true in
association rule mining.

A : Conﬁdence

B : Support

C : Lift

D : None of These

Q.no 33. Among the following clustering algorithm types in which of the following
type the notion of similarity is derived by the closeness of a data point to the
centroid of the clusters.

A : Connectivity models

B : Centroid models

C : Distribution models

D : Density models

Q.no 34. The last element of ndarray is indexed by -------------

A:0

B : -1

C:1

D : -2

Q.no 35. ------- changes the the arrangement of items form array so that shape of
array changes while maintaining the same number of dimensions.

A : numpy. Reshape()

B : numpy. Empty()

C : numpy. Flatten()

D : numpy.ravel()

Q.no 36. Identify the machine generated unstructured data.

A : Website data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 37. ------------- is unsupervised machine learning technique.

A : KNN

B : Support Vector Machines

C : Decision trees

D : Cluster analysis

Q.no 38. Support(B) =

A : (Transacions containing (B)) / (Total Transactions)

B : (Transacions containing (B)) / 100

C : (Total Transactions) / (Transacions containing (B))

D : 100/ (Transacions containing (B))

Q.no 39. ------------ is an example of semi structured data

A : XML data

B : YouTube data

C : Text File data

D : Satellite imagery data

Q.no 40. In decision trees leaf node denotes a -----------------

A : class distribution

B : test on an attribute

C : outcome of the test

D : class labels

Q.no 41. Which of the following algorithm is used in Economics, Finance, Biology
etc, to model relationships between parameters of intrests.
A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays

Q.no 42. In regression the dependent variable is also called as ------------

A : Regression

B : Continuous

C : Regressand

D : Independent

Q.no 43. In regression the independent variable is also called as ------------

A : Regressor

B : Continuous

C : Regressand

D : Estimated

Q.no 44. Which of the following function is not used to iterate over the rows of the
DataFrame.

A : iteritems()

B : iterrows()

C : itertuples()

D : iterpanel()

Q.no 45. ------------ analysis is a set of statistical processes for estimating the
relationships among dependent and independent variables.

A : Regression

B : Decision tree

C : KNN

D : None of These
Q.no 46. In unsupervised learning, scikit learn uses ------------------- method to infer
properties of the data.

A : extract()

B : transform()

C : infer()

D : classify()

Q.no 47. To reach to the ﬁnal point and to make prediction , decision trees must
be traversed from ----------

A : Top - to - bottom

B : Bottom- to - Top

C : Left- to Right

D : Right - to - Left

Q.no 48. The -- ----- is characterized by a bell shapped curve and area under curve
represents probabilities

A : Normal Distribution

B : Binomial Distribution

C : Poission Distribution

D : Probability

Q.no 49. Which of the following function is used to split a ﬁgure into nrows*ncols
sub-axes.

A : plot()

B : draw()

C : bar()

D : subplot()

Q.no 50. In Data science project data acquisition step involves----------------

A : Acquiring data from various sources.

B : Selecting dataset
C : Data preprocessing

D : Data modeling

Q.no 51. ----------- function from scipy is used to calculate the distance between all
pairs of points in a given set.

A : scipy.spatial.distance()

B : scipy.spatial.distance.measure()

C : scipy.spatial.distance.cdist()

D : distance(x1,y1)

Q.no 52. Which function returns an ndarray object that contains the numbers that
are evenly spaced on a log scale.

A : numpy.logspace()

B : numpy.log()

C : numpy.ﬁll()

D : numpy.random()

Q.no 53. In matplotlib -------------- is container class for ﬁgure instance.

A : Axes

B : Canvas

C : Figure

D : FigureCanvas

Q.no 54. ---------- machine learning algorithm used in cross marketing to work with
other businesss that complement your own business but not to other competitors.

A : Decision tree

B : Association Rule Mining

C : Clustering

D : Support vector machine

Q.no 55. Select the correct statement:

A : Raw data is original source of data.

B : Preprocessed data is original source of data.

C : Raw data is the data obtained after processing steps.

D : Analysed data is original source of data.

Q.no 56. It is a measure of disorder or purity or unpredictability or uncertainty.

A : Entropy

B : Support

C : Conﬁdence

D : lift

Q.no 57. To determine basic salary of a employee when his qualiﬁcation is given is
a ----------- problem

A : Correlation

B : Regression

C : Association

D : Qualitative

Q.no 58. The statement subplot( 4,3,5) will divide ﬁgure into ------- and specify
plotting sholud be done on plot number-----------

A : 4 x 3, 5

B : 3x 4, 5

C : 3 x 5, 4

D : 5x 3, 4

A : Regression

B : Decision Trees

C : Clustering

D : Naïve bays
Q.no 60. --------- function is used to display an image through an external viewer in
scipy.

A : display()

B : imread()

C : imshow()

D : show()
Answer for Question No 1. is a

Answer for Question No 2. is d

Answer for Question No 3. is d

Answer for Question No 4. is c

Answer for Question No 5. is a

Answer for Question No 6. is a

Answer for Question No 7. is c

Answer for Question No 8. is d

Answer for Question No 9. is b

Answer for Question No 10. is d

Answer for Question No 11. is a

Answer for Question No 12. is a

Answer for Question No 13. is b

Answer for Question No 14. is a

Answer for Question No 15. is c

Answer for Question No 16. is c

Answer for Question No 17. is c

Answer for Question No 18. is a

Answer for Question No 19. is a

Answer for Question No 20. is b

Answer for Question No 21. is a

Answer for Question No 22. is a

Answer for Question No 23. is b

Answer for Question No 24. is a

Answer for Question No 25. is d

Answer for Question No 26. is b

Answer for Question No 27. is d

Answer for Question No 28. is d

Answer for Question No 29. is b

Answer for Question No 30. is d

Answer for Question No 31. is b

Answer for Question No 32. is a

Answer for Question No 33. is b

Answer for Question No 34. is b

Answer for Question No 35. is a

Answer for Question No 36. is d

Answer for Question No 37. is d

Answer for Question No 38. is a

Answer for Question No 39. is a

Answer for Question No 40. is c

Answer for Question No 41. is a

Answer for Question No 42. is c

Answer for Question No 43. is a

Answer for Question No 44. is d

Answer for Question No 45. is a

Answer for Question No 46. is b

Answer for Question No 47. is a

Answer for Question No 48. is a

Answer for Question No 49. is d

Answer for Question No 50. is a

Answer for Question No 51. is c

Answer for Question No 52. is a

Answer for Question No 53. is d

Answer for Question No 54. is b

Answer for Question No 55. is a

Answer for Question No 56. is a

Answer for Question No 57. is b

Answer for Question No 58. is a

Answer for Question No 59. is b

Answer for Question No 60. is c

DA---Unit I
This sheet is for 1 Mark questions
S.r No Question Image a b c d Correct Answer
e.g 1 Write down question img.jpg Option a Option b Option c Option d a/b/c/d

Business intelligence (BI) is a broad category

of application programs which
1 includes _____________ a) Decision support b) Data mining c) OLAP d) All of the mentioned d
a) Distinguish the
BI can catalyze a business’s success products and services b) Rank customers and c) Ranks customers and
2 in terms of _____________ that drive revenues locations based on profitability locations based on probability d) All of the mentioned d
Which of the following areas are
b
3 affected by BI? a) Revenue b) CRM c) Sales d) All of the mentioned
________ is a performance management
tool that recapitulates an organization’s
a
performance from several standpoints
4 on a single page. a) Balanced Scorecard b) Data Cube c) Dashboard d) All of the mentioned
__________ is a system where operations
like data extraction, transformation and a
5 loading operations are executed. a) Data staging b) Data integration c) ETL d) None of the mentioned
_________ is a category of applications and
technologies for presenting and analyzing c
6 corporate and external data. a) Data warehouse b) MIS c) EIS d) All of the mentioned
Which of the following is the process of
basing an organization’s actions and decisions a) Institutional performance a
7 on actual measured results of performance? management b) Gap analysis c) Slice and Dice d) None of the mentioned
Which of the following does not form part
d
8 of BI Stack in SQL Server? a) SSRS b) SSIS c) SSAS d) OBIEE
a) Distinguish the
BI can catalyze a business’s success products and services that drive b) Rank customers and c) Ranks customers and d
9 in terms of _____________ revenues locations based on profitability locations based on probability d) All of the mentioned
This is an approach to selling goods and services in which
a prospect explicitly agrees in advance to receive marketing c
10 information. A. customer managed relationship B. data mining C. permission marketing D. one-to-one marketing
In an Internet context, this is the practice of tailoring Web
11 pages to individual users’ characteristics or preferences. a. Web services b. customer-facing c. client/server d. personalization d
This is the processing of data about customers and their
relationship with the enterprise in order to improve the enterprise’s d
12 future sales and service and lower cost. a. clickstream analysis b. database marketing c. customer relationship management d. CRM analytics
This is a broad category of applications and technologies for
gathering, storing, analyzing, and providing access to data to help d
13 enterprise users make better business decisions. a. best practice b. data mart c. business information warehouse d. business intelligence
This is a systematic approach to the gathering, consolidation,
and processing of consumer data (both for customers and potential a
14 customers) that is maintained in a company’s databases. a. database marketing b. marketing encyclopedia c. application integration d. service oriented integration
This is an arrangement in which a company outsources some
or all of its customer relationship management functions to an d. Customer Information c
15 application service provider (ASP). a. spend management b. supplier relationship management c. hosted CRM Control System
This is an XML-based metalanguage developed by the Business
Process Management Initiative (BPMI) as a means of modeling
b
business processes, much as XML is, itself, a metalanguage
16 with the ability to model enterprise data. a. BizTalk b. BPML c. e-biz d. ebXML
This is a central point in an enterprise from which all customer
a
17 contacts are managed. a. contact center b. help system c. multichannel marketing d. call center
This is the practice of dividing a customer base into groups of
individuals that are similar in specific ways relevant to marketing, d
18 such as age, gender, interests, spending habits, and so on. a. customer service chat b. customer managed relationship c. customer life cycle d. customer segmentation
In data mining, this is a technique used to predict future behavior
d
19 and anticipate the consequences of change. a. predictive technology b. disaster recovery c. phase change d. predictive modeling
1. According to analysts, for what can
traditional IT systems provide a foundation when
a
they’re integrated with big data technologies Big data management Data warehousing and Management of Hadoop Collecting and storing
20 like Hadoop? and data mining business intelligence clusters unstructured data
21 All of the following accurately describe Hadoop, EXCEPT: Open source Real-time Java-based Distributed computing approach b
22 __________ has the world’s largest Hadoop cluster. Apple Datamatics Facebook None of the mentioned c
23 What are the five V’s of Big Data? Volume velocity Variety All of the above d
_________ hides the limitations of Java behind a powerful
b
24 and concise Clojure API for Cascading. Scalding Cascalog Hcatalog Hcalding
25 What are the main components of Big Data? MapReduce HDFS YARN All of these d
26 What are the different features of Big Data Analytics? Open-Source Scalability Data Recovery All the above d
Define the Port Numbers for NameNode, Task Tracker and
d
27 Job Tracker. NameNode Task Tracker Job Tracker All of the above
28 Facebook Tackles Big Data With _______ based on Hadoop Project Prism Prism ProjectData ProjectBid a
29 What is a unit of data that flows through a Flume agent? Record Event Row Log b
A feature F1 can take certain value: A, B, C, D, E, & F and
represents grade of students from a college. Which of the following Feature F1 is an example Feature F1 is an example It doesn’t belong to any b
30 statement is true in the following case of nominal variable. of ordinal variable. of the above category. Both of these
Which of the following is an example of a deterministic
all of the above a
31 algorithm? PCA K-Means None of the above
32 What is the entropy of the target variable? -(5/8 log(5/8) + 3/8 log(3/8)) 5/8 log(5/8) + 3/8 log(3/8) 5/8 log(5/8) + 3/8 log(3/8) 5/8 log(3/8) – 3/8 log(5/8) a
a) OLAP is an umbrella term that
refers to an assortment of software
applications for analyzing an b) Business intelligence equips c) BI makes an organization agile b
organization’s raw data for intelligent enterprises to gain business thereby giving it a lower edge in today’s
33 Point out the correct statement. decision making advantage from data evolving market condition None of the mentioned
a) Distinguish the products b) Rank customers and locations c) Ranks customers and
d
34 BI can catalyze a business’s success in terms of _____________ and services that drive revenues based on profitability locations based on probability d) All of the mentioned
35 Which of the following areas are affected by BI? a) Revenue b) CRM c) Sales d) All of the mentioned b
Which of the following does not form part
d
36 of BI Stack in SQL Server? a) SSRS b) SSIS c) SSAS d) OBIEE
a) Distinguish the
BI can catalyze a business’s success products and services that drive b) Rank customers and c) Ranks customers and d
37 in terms of _____________ revenues locations based on profitability locations based on probability d) All of the mentioned
A set of databases from different An approach to a problem that is not Information that is hidden in a database
Heuristic is vendors, possibly using different guaranteed to work but performs well and that cannot be recovered by a simple None of these
38 database paradigms in most cases SQL query. b
In an Internet context, this is the practice of tailoring Web
39 pages to individual users’ characteristics or preferences. a. Web services b. customer-facing c. client/server d. personalization d
A set of databases from different b An approach to a problem that is not Information that is hidden in a database
Heterogeneous databases referred to vendors, possibly using different guaranteed to work but performs well and that cannot be recovered by a simple None of these a
40 database paradigms in most cases. SQL query.
DA---Unit II
This sheet is for 1 Mark questions
S.r No Question Image a b c d Correct Answer

e.g 1 Write down question img.jpg Option a Option b Option c Option d a/b/c/d

1 Movie Recommendation systems are an example of: Classification Clustering Reinforcement Learning Regression b and c
2 Sentiment Analysis is an example of: Regression Classification Clustering Reinforcement Learning a,b and d
What is the minimum no. of variables/ features required to
0 1 2 3
3 perform clustering? b
Is it possible that Assignment of observations to clusters
Yes No Can’t say None of these
4 does not change between successive iterations in K-Means a
Assignment of observations
to clusters does not change Centroids do not change
Which of the following can act as possible termination For a fixed number of Terminate when RSS falls below a
between iterations. Except between successive
conditions in K-Means? iterations. threshold.
for cases with a bad local iterations.
5 minimum. a,b,c,d
Which of the following clustering algorithms suffers from the Agglomerative clustering Expectation-Maximization
K- Means clustering algorithm Diverse clustering algorithm
6 problem of convergence at local optima? algorithm clustering algorithm a and c
K-medians clustering K-modes clustering
Which of the following algorithm is most sensitive to outliers? K-means clustering algorithm K-medoids clustering algorithm
7 algorithm algorithm a
How can Clustering (Unsupervised Learning) be used to Creating an input feature for Creating an input feature
Creating different models for Creating an input feature for cluster
improve the accuracy of Linear Regression model cluster ids as an ordinal for cluster centroids as a
different cluster groups. size as a continuous variable.
8 (Supervised Learning): variable. continuous variable. a,b,c,d
What could be the possible reason(s) for producing two
different dendrograms using agglomerative clustering Proximity function used of data points used of variables used All of the above
9 algorithm for the same dataset? d
In which of the following cases will K-Means clustering fail to Data points with different Data points with round Data points with non-convex
Data points with outliers
10 give good results? densities shapes shapes a,b,and d
mputation with
Which of the following is/are valid iterative strategy for Nearest Neighbor
Imputation with mean Expectation Maximization All of the above
treating missing values before clustering analysis? assignment
11 algorithm c
In distance calculation it will You always get the same In Manhattan distance it
Feature scaling is an important step before applying K-Mean
give the same weights for all clusters. If you use or don’t is an important step but in None of these
algorithm. What is reason behind this?
12 features use feature scaling Euclidian it is not a
Which of the following method is used for finding optimal of
Elbow method Manhattan method Ecludian mehthod All of the above
13 cluster in K-Mean algorithm? a
K-means is extremely sensitive Bad initialization can lead to Bad initialization can lead
What is true about K-Mean Clustering?
14 to cluster center initializations Poor convergence speed to bad overall clustering None of these d
Which of the following can be applied to get good results for Try to run algorithm for Find out the optimal
Adjust number of iterations
15 K-means algorithm corresponding to global minima? different centroid initialization number of clusters None of these a,b,c
If you are using Multinomial mixture models with the
expectation-maximization algorithm for clustering a set of All the data points follow two All the data points follow n All the data points follow All the data points follow n
data points into two clusters, which of the assumptions are Gaussian distribution Gaussian distribution (n >2) two multinomial distribution multinomial distribution (n >2)
16 important: c
Which of the following is/are not true about Centroid based Both have strong
Both starts with random Expectation maximization algorithm
K-Means clustering algorithm and Distribution based Both are iterative algorithms assumptions that the data
initializations is a special case of K-Means
17 expectation-maximization clustering algorithm: points must fulfill d
For data points to be in a
It has strong assumptions for It has substantially high
Which of the following is/are not true about DBSCAN cluster, they must be in a It does not require prior knowledge
the distribution of data time complexity of order
clustering algorithm: distance threshold to a core of the no. of desired clusters
points in dataspace O(n3)
18 point b and c
Which of the following are the high and low bounds for the
[0,1] (0,1) [-1,1] None of the above
19 existence of F-Score? a
1. All of the following increase the width
20 of a confidence interval except: a. Increased confidence level b. Increased variability c. Increased sample size d. Decreased sample size c
c. The probability d. The probability of
a. The probability of b. The probability that the observed results observing results as extreme or
3The p-value in hypothesis testing represents failing to reject the null that the null hypothesis is are statistically significant, more extreme than currently
which of the following: Please select the best answer of hypothesis, given the true, given the observed given that the null observed, given that the null
21 those provided below. observed results results hypothesis is true hypothesis is true d
4. Assume that the difference between the
observed, paired sample values is defined in the same
manner and that the specified significance level is the same
for both hypothesis tests. Using the same data, the
statement that “a paired/dependent two sample t-test is
equivalent to a one sample t-test on the paired differences,
resulting in the same test statistic, same p-value, and same
conclusion” is: Please select the best answer of those
22 provided below. a. Always True b. Never True c. Sometimes True d. Not Enough Information a
19. Green sea turtles have normally
distributed weights, measured in kilograms, with a mean of
134.5 and a variance of 49.0. A particular green sea turtle’s
weight has a z-score of -2.4. What is the weight of this green
23 sea turtle? Round to the nearest whole number. a. 17 kg b. 151 kg c. 118 kg d. 252 kg c
What percentage of measurements in a dataset
24 fall above the median? a. 49% b. 50% c. 51% d. Cannot Be Determined d
24. The proportion of variation in 5k race
times that can be explained by the variation in the age of
competitive male runners was approximately 0.663. What is
the value of the sample linear correlation coefficient? Round
25 to 3 decimal places. a. 0.663 b. 0.814 c. -0.814 d. 0.440 c
a. Yes; linear b. Yes; both the c. No; linear
25. Using all of the results provided, is it correlation between age and sample linear regression correlation between age d. No; the age provided
reasonable to predict the 5k race time (minutes) of a 5k race times is statistically equation and an age in and 5k race times is not is beyond the scope of our
26 competitive male runner 73 years of age? significant years is provided statistically significant available sample data d
It uses machine-learning Science of making
Computational procedure
techniques. Here program can machines performs tasks
that takes some value as
Algorithm is learn from past experience that would require None of these
input and produces some
and adapt themselves to new intelligence when
value as output
27 situations performed by humans b
An approach to the design
of learning algorithms that
A class of learning algorithm is inspired by the fact that
Any mechanism employed
that tries to find an optimum when people encounter
by a learning system to
Bias is classification of a set of new situations, they often None of these
constrain the search space
examples using the explain them by reference
of a hypothesis
probabilistic theory to familiar experiences,
adapting the explanations
28 to fit the new situation. b
A measure of the accuracy,
A subdivision of a set of The task of assigning a
of the classification of a
Classification is examples into a number of classification to a set of None of these
concept that is given by a
classes examples
29 certain theory a
This takes only two values. In
Systems that can be used
general, these values will be 0 The natural environment of a
Binary attribute are without knowledge of None of these
and 1 and .they can be coded certain species
internal operations
30 as one bit a
Measure of the accuracy, of
A subdivision of a set of The task of assigning a
the classification of a
Classification accuracy is examples into a number of classification to a set of None of these
concept that is given by a
classes examples
31 certain theory b
Operations on a database to Symbolic representation of
Group of similar objects that
transform or simplify data in facts or ideas from which
Cluster is differ significantly from other None of these
order to prepare it for a information can potentially
objects
32 machine-learning algorithm be extracted a
A definition of a concept is-----if it recognizes all the instances
Complete Consistent Constant None of these
33 of that concept a
A definition or a concept is------------- if it classifies any
Complete Consistent Constant None of these
34 examples as coming within the concept b
A subject-oriented
The actual discovery phase of integrated time variant
The stage of selecting the
Data selection is a knowledge discovery non-volatile collection of None of these
right data for a KDD process
process data in support of
35 management b
A measure of the accuracy,
A subdivision of a set of The task of assigning a
of the classification of a
Classification task referred to examples into a number of classification to a set of None of these
concept that is given by a
classes examples
36 certain theory c
Decision support systems
Approach to the design of that contain an information
Combining different types of learning algorithms that is base filled with the
Hybrid is None of these
method or information structured along the lines of knowledge of an expert
the theory of evolution. formulated in terms of
37 if-then rules. a
An extremely complex
It is hidden within a database
The process of executing molecule that occurs in
and can only be recovered if
implicit previously unknown human chromosomes and
Discovery is one is given certain clues (an None of these
and potentially useful that carries genetic
example IS encrypted
information from data information in the form of
information).
38 genes. b
What could be the possible reason(s) for producing two
different dendrograms using agglomerative clustering Proximity function used of data points used of variables used All of the above
39 algorithm for the same dataset? d
Is it possible that Assignment of observations to clusters
Yes No Can’t say None of these
40 does not change between successive iterations in K-Means a
DA---Unit III

This sheet is for 1 Mark questions

S.r No Question Image a b c d orrect Answer
e.g 1 Write down question img.jpg Option a Option b Option c Option d a/b/c/d

This clustering
algorithm terminates
when mean values
computed for the
current iteration of
This clustering algorithm terminates the algorithm are
when mean values computed for the identical to the
current iteration of the algorithm are computed mean expectation
identical to the computed mean values values for the K-Means conceptual maximizatio agglomerative
1 for the previous iteration previous iteration clustering clustering n clustering a
As the value
of one
attribute
As the value of increases the
one attribute value of the
The attributes decreases the second
The correlation coefficient for two are not value of the attribute The attributes
real-valued attributes is –0.85. What linearly second attribute also show a linear
2 does this value tell you? related. increases. increases. relationship b
Y is false X is true
Given a rule of the form IF X THEN Y, when X is Y is true when X when Y is X is false when
b
rule confidence is defined as the known to be is known to be known to be Y is known to
3 conditional probability that false. true. true be false.
Density based Hierarchical
clustering Partitioning Model based clustering d
4 Chameleon is algorithm based algorithm algorithm algorithm
5 Find odd man out DBSCAN K-Mean PAM None of above a
increases decreases with
with the size increase in size
increases with decreases with of the of the c
The number of iterations in apriori the size of the the increase in maximum maximum
6 ___________ data size of the data frequent set frequent set
Which of the following are
interestingness measures for b
7 association rules? Recall Lift Accuracy All of Above
2k – 1 2k – 2
candidate candidate 2k -2 candidate
c
Given a frequent itemset L, If |L| = k, association 2k candidate association association
8 then there are rules association rules rules rules
_________ is an example for case Neural Genetic K-nearest
d
9 based-learning Decision trees networks algorithm neighbor
The average positive difference mean
between computed and desired mean positive mean squared absolute root mean c
10 outcome values. error error error squared error
Superset of
both closed
frequent item
Superset of Superset of only Subset of sets and
only closed maximal maximal maximal
frequent item frequent item frequent frequent item
11 Frequent item sets is sets sets item sets sets d
Assume that we have a dataset
containing information about 200
individuals. A supervised data mining
session has discovered the following
rule: IF age < 30 & credit card
insurance = yes THEN life insurance b
= yes Rule Accuracy: 70% and Rule
Coverage: 63% How many
individuals in the class life insurance=
no have credit card insurance and are
12 less than 30 years old? 63 38 40 89
Which of the following is cluster Simple Grouping Labeled Query results
b
13 analysis? segmentation similar objects classification grouping
low intra
class
high inter similarity c
A good clustering method will class high intra class
14 produce high quality clusters with similarity similarity None of above
Min sup and
Which two parameters are needed for Min points and min Number of b
15 DBSCAN Min threshold eps confidence centroids
Both
techniques
build models
whose output
is determined Both models
by a linear require d
sum of numeric
weighted The output of attributes to Both models
Which statement is true about neural input both models is a range require input
network and linear regression attribute categorical between 0 attributes to be
16 models? values. attribute value. and 1. numeric.
In Apriori algorithm, if 1 item-sets are
100, then the number of candidate 2 c
17 item-sets are 100 200 4950 5000
Finding
Significant Bottleneck in the Apriori frequent Candidate Number of c
18 algorithm is itemsets Pruning generation iterations
are better
able to deal typically assume have trouble
Machine learning techniques differ with missing an underlying with are not able to a
from statistical techniques in that and noisy distribution for large-sized explain their
19 machine learning methods data the data datasets behavior.
The probability of a hypothesis before
a
20 the presentation of evidence. a priori posterior conditional subjective
21 KDD represents extraction of data knowledge rules model b
Outliers
should be
part of the The nature Outliers should
training of the be part of the
c
dataset but Outliers should problem test dataset but
should not be be identified determines should not be
Which statement about outliers is present in the and removed how outliers present in the
22 true? test data. from a dataset. are used training data.
23 The most general form of distance is Manhattan Eucledian Mean Minkowski d
High support High support Low support Low support
Which Association Rule would you and medium and low and high and low c
24 prefer confidence confidence confidence confidence
In a Rule based classifier, If there is a
rule for each combination of attribute
a
values, what do you called that rule Comprehens Mutually
25 set R Exhaustive Inclusive ive exclusive
To improve
To decrease the the
If a set cannot efficiency, do efficiency, If a set can pass
pass a test, its level-wise do level-wise a test, its a
supersets will generation of generation supersets will
also fail the frequent item of frequent fail the same
26 The apriori property means same test sets item sets test
If an item set ‘XYZ’ is a frequent item
set, then all subsets of that frequent c
27 item set are Undefined Not frequent Frequent Can not say
The probability that a person owns a
sports car given that they subscribe to
automotive magazine is 40%. We also
know that 3% of the adult population
subscribes to automotive magazine.
The probability of a person owning a
b
sports car given that they donâ€™t
subscribe to automotive magazine is
30%. Use this information to compute
the probability that a person
subscribes to automotive magazine
28 given that they own a sports car 0.0368 0.0396 0.0389 0.0398
Simple regression assumes a __________
relationship between the input c
29 attribute and output attribute. quadratic inverse linear reciprocal
Both
Only minimum
minimum Neither support support and Minimum c
To determine association rules from confidence not confidence confidence support is
30 frequent item sets needed needed are needed needed
If {A,B,C,D} is a frequent itemset,
candidate rules which is not possible b
31 is C –> A D –>ABCD A –> BC B –> ADC
High support Low support Low support High support
Which Association Rule would you and low and high and low and medium b
32 prefer confidence confidence confidence confidence
Classification rules are extracted from
a
33 _____________ decision tree root node branches siblings
What does K refers in the K-Means
algorithm which is a non-hierarchical No of . number of d
34 clustering approach? Complexity Fixed value iterations clusters
If Linear regression model perfectly Test error is Couldn’t Test error is
first i.e., train error is zero, then also always Test error is non comment on equal to Train c
35 _____________________ zero zero Test error error
Which of the following metrics can be
used for evaluating regression
models? i)R Squared ii) Adjusted R d
Squared iii) F Statistics iv) RMSE/MSE/
MAE ii and iv i and ii ii, iii and iv i, ii, iii and iv
How many coefficients do you need to
estimate in a simple linear regression 1 2 3 4 b
37 model (One independent variable)?
In a simple linear regression model
(One independent variable), If we
change the input variable by 1 unit. d
How much output variable will
38 change? by 1 no change by intercept by its slope
In syntax of linear model
lm(formula,data,..), data refers to array vector list c
39 ______ Matrix
In the mathematical Equation of
Linear Regression Y = β1 + β2X + ϵ, (X-intercept, (Slope, (Y-Intercept, (slope, c
40 (β1, β2) refers to __________ Slope) X-Intercept) Slope) Y-Intercept)
DA---Unit IV

This sheet is for 1 Mark questions

Correct
S.r No Question Image a b c d
Answer
e.g 1 Write down question img.jpg Option a Option b Option c Option d a/b/c/d
A _________ is a decision support tool that uses a
tree-like graph or model of decisions and their
possible consequences, including chance event Neural
1 outcomes, resource costs, and utility. Decision tree Graphs Trees Networks a
Structure in Flow-Chart &
which internal Structure in
node represents which internal
test on an node None of
3 What is Decision Tree? Flow-Chart attribute, each represents Above c
Decision Trees can be used for Classification
4 Tasks. TRUE FALSE a
Choose from the following that are Decision Tree Decision Chance
5 nodes? Nodes End Nodes Nodes All of Above d
6 Decision Nodes are represented by ____________ Disks Squares Circles Triangles b
7 Chance Nodes are represented by __________ Disks Squares Circles Triangles c
8 End Nodes are represented by __________ Disks Squares Circles Triangles d
Worst, best
Use a white box and expected
model, If given values can be
Possible result is determined for
Which of the following are the advantage/s of Scenarios can provided by a different
9 Decision Trees? be added model scenarios All of Above d
Attributes are Attributes are
statistically statistically
dependent of independent Attributes
Attributes are one another of one another can be
Which of the following statements about Naive equally given the class given the nominal or
10 Bayes is incorrect? important. value. class value. numeric b
Linear Naive
11 Which of the following is not supervised learning? Clustering Decision Tree Regression Bayesian a
How many terms are required for building a bayes
12 model? 1 2 3 4c
Answering
Solving Increasing Decreasing probabilistic
13 Where does the bayes rule can be used? queries complexity complexity query d
How the bayesian network can be used to answer Full Partial
14 any query? distribution Joint distribution distribution All of Above b
Both
Conditionally
What is the consequence between a node and its Functionally Conditionally dependant &
15 predecessors while creating bayesian network? dependent Dependant independent Dependant c
An approach
to the design
of learning
algorithms
that is inspired
by the fact
that when
people
encounter
A class of new
learning situations,
algorithm that they often
tries to find an explain them
optimum by reference
classification Any mechanism to familiar
of a set of employed by a experiences,
examples learning system adapting the
using the to constrain the explanations
probabilistic search space of to fit the new None of
16 Bayesian classifiers is theory. a hypothesis situation. these a
An approach
to the design
of learning
algorithms
that is inspired
by the fact
that when
people
encounter
A class of new
learning situations,
algorithm that they often
tries to find an explain them
optimum by reference
classification Any mechanism to familiar
of a set of employed by a experiences,
examples learning system adapting the
using the to constrain the explanations
probabilistic search space of to fit the new None of
17 Bias is theory a hypothesis situation. these b
Additional
acquaintance
used by a
learning
algorithm to A neural
facilitate thenetwork that It is a form of
learning makes use of a automatic None of
18 Background knowledge referred to process hidden layer learning. these a
A measure of
the accuracy, of
A subdivision the The task of
of a set of classification of assigning a
examples into a concept that classification
a number of is given by a to a set of None of
19 Classification accuracy is classes certain theory examples these b
A measure of
the accuracy, of
A subdivision the The task of
Classification is of a set of classification of assigning a
examples into a concept that classification
a number of is given by a to a set of None of
20 classes certain theory examples these a
An extremely
complex
It is hidden molecule that
within a The process of occurs in
database and executing human
can only be implicit chromosomes
recovered if previously and that
one is given unknown and carries
certain clues potentially genetic
(an example useful information in
IS encrypted information from the form of None of
21 Discovery is information). data genes. these b
A measure of
the accuracy, of
A subdivision the The task of
Classification task referred to of a set of classification of assigning a
examples into a concept that classification
a number of is given by a to a set of None of
22 classes certain theory examples these c
The process of
finding a
solution for a
problem simply
by enumerating
all possible The distance
A stage of the solutions between two
KDD process according to points as
in which new some calculated
data is added pre-defined using the
to the existing order and then Pythagoras None of
23 Euclidean distance measure is selection. testing them theorem these c
The problem of finding hidden structure in unlabeled Supervised Unsupervised Reinforcemen None of
24 data is called learning learning t learning these b
Assume you want to perform supervised learning
and to predict number of newborns according to
size of storks’ population Structural
(http://www.brixtonhealth.com/storksBabies.pdf), it equation
25 is an example of Classification Regression Clustering modeling b
Discriminating between spam and ham e-mails is a
26 classification task, true or false? TRUE FALSE a
Data
Knowledge Data Data transformatio
27 which of the following is not involve in data mining? extraction archaeology exploration n d
A prediction
A class of made using
learning A table with n an extremely
algorithms that independent simple
try to derive a attributes can method, such
Prolog be seen as an as always
program from n- dimensional predicting the None of
28 Naive prediction is examples space. same output. these c
In the context of
KDD and data
mining, this One of the
refers to defining
random errors aspects of a
A component in a database data None of
29 Node is of a network table. warehouse these a
One of several
possible enters Discipline in
within a statistics that
database table studies ways
that is chosen to find the
The result of by the designer most
the application as the primary interesting
of a theory or means of projections of
a rule in a accessing the multi-dimensio None of
30 Prediction is specific case data in the table. nal spaces. these a
What is the relation between the distance between
clusters and the corresponding class inversely-propor None of
31 discriminability? proportional tional no-relation these a
the classification method in which the upper limit of
interval is same as of lower class interval is exclusive inclusive mid point None of
32 called…. method method method these a
larger value is 60 and the smallest value is 40 and
33 the number of classes is 5 then the class interval is 20 25 4 15 c
summary and presentation of data in tabular form nominal frequency ordinal None of
34 with several non overlapping classes is referred as distribution distribution distribution these b
the classification method in which the upper and
lower limit of interval is also in class interval itself is exclusive inclusive mid point None of
35 called…. method method method these b
Suppose there are 25 base classifiers. Each
classifier has error rates of e = 0.35. Suppose you
are using averaging as ensemble of above 25
classifiers will make a wrong prediction? Note: all
36 classifiers are independent of each other 0.05 0.06 0.07 0.08 b
Area under
The most widely used metrics and tools to assess
Confusion Cost-sensitive the ROC
a classification model are:
37 matrix accuracy curve All of Above d
Normalize the
Normalize PCA → data → PCA
When performing regression or classification, the data → normalize PCA → normalize
which of the following is the correct way to PCA → output → PCA output → None of
38 preprocess the data? training training training these a
Assumes that
all the features Assumes that
in a dataset all the features
are equally in a dataset are None of
39 Which of the following is true about Naive Bayes ? important independent both a and b these c
In which of the following cases will K-means
clustering fail to give good results? 1) Data points
with outliers 2) Data points with different densities
40 3) Data points with nonconvex shapes 1 and 2 2 and 3 1, 2, and 3 1 and 3 c
S.r No Question Image a b c d Correct Answer
Data visualtization is realted with… Pictorial numerical numerical None of these a
1 representaions representation calculations
Which of the following are Use of See context of Clear data finding pattern in all of above d
2 data visualtization data understanding data
Which of the following statements I AND II II AND III I AND III ONLY III d
are true about using visualizations
to display a dataset? I.
Visualizations are visually appealing,
but don’t help the viewer understand
relationships that exist in the data

II. Visualizations like graphs, charts,

or visualizations with pictures are
useful for conveying information,
while tables just filled with text are
not useful.

III. Patterns that exist in the data

can be found more easily by using a
3 visualization
The plot method on Series and gplt.plot() plt.plot() plt.plotgraph() none of the b
DataFrame is just a simple wrapper mentioned
4 around ____________
Point out the correct combination ‘hist’ for ‘box’ for ‘area’ for area all of the d
with regards to kind keyword for histogram boxplot plots mentioned
5 graph plotting.
Which of the following value is bar bar bar none of the a
provided by kind keyword for mentioned
6 barplot?
You can create a scatter plot matrix sca_matrix scatter_matrix DataFrame.plot all of the b
using the __________ method in mentioned
7 pandas.tools.plotting.
Plots may also be adorned with True FALSE Cannot Tell All Above a
8 error bars or tables.
Which of the following plots are Autocausation Autorank Autocorrelation none of the c
often used for checking randomness mentioned
9 in time series?
__________ plots are used to Lag RadViz Bootstrap All Above c
visually assess the uncertainty of a
10 statistic
Which of the following is not a Velocity Volume Version Variety c
challenge in Big Data
11 Visualization>?
Which of the following is not a Visual Noise Scaled Data Large image Information b
12 problem in Big Data Visualization>? perception Loss
Which of the following is a problem Structured Data Scaled Data Visual Noise Multiple valued c
13 in Big Data Visualization>? Data
Which of the candidate is suitable Type of Visual Cardinality Size of data all of above d
14 for interactive visualtization?
Which of the following follows Zoom+Pan Focus+Context Overview+Details all of above d
15 interactive visualization approach?
Visual Mapping is important Remapping Overview+Deta Focus Context a
16 for_______ ils
17 Data visualtization techniques are: Scatter Plot Line Chart Pie Chart all of above d
Information Visualtization Flow Chart Time Line DFD All of above d
18 techniques are
19 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these c
Information Visualtization Flow Chart Line Chart Pie Chart None of these a
20 techniques are
21 Data visualtization techniques are: Scatter Plot Time Line DFD None of these a
Information Visualtization Scatter Plot Time Line Bubble Chart None of these b
22 techniques are
Data visualtization techniques are: Histogram Parallel Time Line None of these a
23 Coordinates
Information Visualtization Semantic Histogram Area Chart None of these a
24 techniques are Network
Which of the following is realted Exponential U-Shape Null All of above d
25 term with correlation?
26 Data visualtization techniques are: Scatter Plot Time Line DFD None of these a
Coulmn graph is another name for Bar Chart Scatterplot Histogram Area Chart a
27 _____
Which of the following follows Zoom+Pan Focus+Context Overview+Details all of above d
28 interactive visualization approach?
information Visualtization techniques Pie Chart Scatterplot Histogram Area Chart a
29 are
Which of the following is category of Linear Timeline Modular Variant Timeline ER Timeline a
30 timeline? Timeline
Which of the following specifies Scatter Plot Line Chart Area Chart d
31 relationship amongst variables? All of above
Which of the following specifies d
32 category Proportions? Pie Chart Histogram Bar chart All of above
Which of the following is category of Variant Timeline ER Timeline Comarative Modular c
33 timeline? Timeline Timeline
Information Visualtization Flow Chart Time Line DFD All of above d
34 techniques are
35 Data visualtization techniques are: Flow Chart Time Line Pie Chart None of these c
Data visualtization is realted with… Pictorial numerical numerical None of these a
36 representaions representation calculations
Which of the following follows Zoom+Pan Focus+Context Overview+Details all of above d
37 interactive visualization approach?
Which of the following are Use of See context of Clear data finding pattern in all of above d
38 data visualtization data understanding data
Which of the following specifies Area Chart c
39 relationship amongst variables? Pie Chart Histogram None of these
Which of the following specifies Scatter Plot Line Chart a
40 category Proportions? Pie Chart None of these
This sheet is for 1 Mark questions
S.r No Question Image a b c d Correct Answer
e.g 1 Write down question img.jpg Option a Option b Option c Option d a/b/c/d
Precies and steady format data is____ Structured Un Structured semi Structured Quasi a
1 Data Data Data Structured Data
Inconsistant Data is______ Structured Un Structured semi Structured Quasi b
2 Data Data Data Structured Data
Format that self defines itself Structured Un Structured semi Structured Quasi c
3 is________ Data Data Data Structured Data
A little Bit inconsistant data is_______ Structured Un Structured semi Structured Quasi d
4 Data Data Data Structured Data
XML is an example of_______ Structured Un Structured semi Structured Quasi
5 Data Data Data Structured Data
RDBMS Folllows__________ Structured Un Structured semi Structured Quasi a
6 Data Data Data Structured Data
7 Watson is developed by____ IBM Microsoft AT&T Google a
8 Hadoop is _____ based Framework. C++ Python JAVA C# c
Which of the following are components MAPREDUCE YARN HDFS All of Above d
9 of Hadoop?
Which of the following are components JDBC Thrift Server CLI All of Above d
10 of HIVE?
Mahout provides__________ JAVA C# Executables Mountable All of Above a
Executable Image Format
11 Libraries
Which of the following are components FLATTEN Thrift Server Muster None of these b
12 of HIVE?
Which of the following are components FLATTEN Thrift Server Muster All of above b
13 of HIVE?
Which of the following is components of Fork YARN CLI Metadata b
14 Hadoop?
RDBMS Folllows__________ Structured Un Structured semi Structured Quasi a
15 Data Data Data Structured Data
Which of the following is a clustering Fuzzy K means Canopy K-Means All of above d
16 techique?
Which of the following is HBASE Data Row Table Column All of Above d
17 Model Terminology?
Which of the following is not a Logistic Random Forest Recommender Naïve Bayes c
18 classification techique? Regression Algo
Which of the following is a classification Logistic Random Forest Naïve Bayes All of Above d
19 techique? Regression
Which of the following is HBASE Data Column Family Cell Timestamp All of Above d
20 Model Terminology?
Which of the following is a clustering Logistic Random Forest K-Means Naïve Bayes c
21 techique? Regression
Which of the following is HBASE Data Identifier Variant Timestamp None of the c
22 Model Terminology? above
Which of the following is not a Logistic Random Forest K-Means Naïve Bayes c
23 classification techique? Regression
Which of the following are components FLATTEN Thrift Server Muster None of these b
24 of HIVE?
Which of the following is HBASE Data Identifier Variant Column Qualifier None of the c
25 Model Terminology? above
Mahout provides__________ JAVA C# Executables Mountable None of the a
Executable Image Format above
26 Libraries
Which of the following is not a Logistic Canopy K-Means Fuzzy K means a
27 clustering techique? Regression
Which of the following is a clustering Fuzzy K means Canopy K-Means All of above d
28 techique?
Point out the correct statement. Hadoop do Hadoop 2.0 allows In Hadoop None of the b
need live stream programming above
specialized processing of framework
hardware to real-time data output files are
process the divided into lines
29 data or records
30
What was Hadoop named after? Creator Doug Cutting’s high The toy elephant A sound c
Cutting’s school rock band of Cutting’s son Cutting’s laptop
favorite circus made during
act Hadoop
31 development
___________programming model used MapReduce Mahout Oozie None of the a
to develop Hadoop-based applications above
that can process massive amounts of
32 data.
Which of the following is not a Logistic Random Forest K-Means Naïve Bayes c
33 classification techique? Regression
Which of the following are components FLATTEN Thrift Server Muster All of above b
34 of HIVE?
Which of the following is components of Fork YARN CLI None of above b
35 Hadoop?
Hadoop is a framework that works with MapReduce, MapReduce, MapReduce, All of above a
a variety of related tools. Common Hive and MySQL and Hummer and
36 cohorts include ____________ HBase Google Apps Iguana
NoSQL databases is used mainly for Structured Un Structured semi Structured Quasi b
handling large volumes of Data Data Data Structured Data
37 ______________ data.
Which of the following is not a phase of Communicatio Recall Data Preparation Model Planning b
38 Data Analytics Life Cycle? n
Which of the following is a NoSQL SQL Document JSON All of above b
39 Database Type? databases
Which of the following is not a NoSQL SQL Server MongoDB Cassandra None of the a
40 database above
UNIT SUB : 410243 DA
ONE
Sr. No. Questions a b c d Ans

1 Business intelligence (BI) is a broad category

of application programs which
a) Decision
support
b) Data mining c) OLAP d) All of the
mentioned
d
includes _____________

2 BI can catalyze a business’s success

4 ________ is a performance management

tool that recapitulates an organization’s
a) Balanced
Scorecard
b) Data Cube c) Dashboard d) All of the
mentioned
a
performance from several standpoints
on a single page.

5 __________ is a system where operations

like data extraction, transformation and
a) Data staging b) Data
integration
c) ETL d) None of the
mentioned
a
loading operations are executed.

6 _________ is a category of applications and

technologies for presenting and analyzing
a) Data
warehouse
b) MIS c) EIS d) All of the
mentioned
c
corporate and external data.

7 Which of the following is the process of a) Institutional

10 This is an approach to selling goods and

services in which
A. customer
managed
B. data mining C. permission
marketing
D. one-to-one
marketing
c
a prospect explicitly agrees in advance to relationship
receive marketing information.

11 In an Internet context, this is the practice of

tailoring Web
a. Web services b. customer-facin c. client/server
g
d. personalizatio
n
d
pages to individual users’ characteristics or
preferences.

12 This is the processing of data about customers a. clickstream

13 This is a broad category of applications and

15 This is an arrangement in which a company

16 This is an XML-based metalanguage

17 This is a central point in an enterprise from

which all customer
a. contact center b. help system c. multichannel
marketing
d. call center
a
contacts are managed.

18 This is the practice of dividing a customer base a. customer

20 1. According to analysts, for what can

29 What is a unit of data that ﬂows through a

32 What is the entropy of the target variable? -(5/8 log(5/8) +

3/8 log(3/8))
5/8 log(5/8) + 3/8 5/8 log(5/8) + 3/8
log(3/8) log(3/8)
5/8 log(3/8) – 3/8
log(5/8)
a

33 Point out the correct statement. a) OLAP is an

35 Which of the following areas are affected by

BI?
a) Revenue b) CRM c) Sales d) All of the
mentioned
b

36 Which of the following does not form part

of BI Stack in SQL Server?
a) SSRS b) SSIS c) SSAS d) OBIEE
d

37 BI can catalyze a business’s success

2 Sentiment Analysis is an example of: Regression Classiﬁcation Clustering Reinforcement

Learning
a,b,d

3 0 1 2 3
What is the minimum no. of variables/ features
required to perform clustering?
b

4 Is it possible that Assignment of observations to

7 Which of the following algorithm is most sensitive to K-means

outliers? clustering
K-medians
clustering
K-modes
clustering
K-medoids
clustering
a
algorithm algorithm algorithm algorithm

8 How can Clustering (Unsupervised Learning) be

10 In which of the following cases will K-Means

14 What is true about K-Mean Clustering? K-means is

16 If you are using Multinomial mixture models with

19 Which of the following are the high and low bounds

for the existence of F-Score?
[0,1] (0,1) [-1,1] None of the
above
a

20 1. All of the following increase the width

of a conﬁdence interval except:
a. Increased
conﬁdence
b. Increased c. Increased
variability sample size
d. Decreased
sample size
c
level

21 3The p-value in hypothesis testing represents a. The

23 19. Green sea turtles have normally

25 24. The proportion of variation in 5k race

31 Classiﬁcation accuracy is A subdivision

35 Data selection is The actual

36 Classiﬁcation task referred to A subdivision

39 What could be the possible reason(s) for producing

1 This clustering algorithm terminates when mean values

3 Given a rule of the form IF X THEN Y, rule conﬁdence is

5 Find odd man out DBSCAN K-Mean PAM None of

10 The average positive difference between computed and

18 Signiﬁcant Bottleneck in the Apriori algorithm is Finding

21 KDD represents extraction of data knowledge rules model

23 The most general form of distance is Manhattan Eucledian Mean Minkowski

24 Which Association Rule would you prefer High support High support Low support
and medium and low and high
Low support
and low
c
confidence confidence confidence confidence

25 In a Rule based classiﬁer, If there is a rule for each

combination of attribute values, what do you called that rule
Exhaustive Inclusive Comprehensi Mutually
ve exclusive
a
set R

26 The apriori property means If a set cannot

29 Simple regression assumes a __________ relationship between quadratic

the input attribute and output attribute.
inverse linear reciprocal
c

30 To determine association rules from frequent item sets Only

36 Which of the following metrics can be used for evaluating

regression models? i)R Squared ii) Adjusted R Squared iii) F
ii and iv i and ii ii, iii and iv i, ii, iii and iv
d
Statistics iv) RMSE/MSE/MAE

37 1 2 3 4
How many coeﬃcients do you need to estimate in a simple
linear regression model (One independent variable)?
b

38 In a simple linear regression model (One independent

variable), If we change the input variable by 1 unit. How
by 1 no change by intercept by its slope
d
much output variable will change?

39 In syntax of linear model lm(formula,data,..), data refers to

Sr. No. Questions a b c d Ans

1 A _________ is a decision support tool that uses a tree-like

4 Decision Trees can be used for Classiﬁcation Tasks. TRUE FALSE

10 Which of the following statements about Naive Bayes is

14 How the bayesian network can be used to answer any