4192 t2 [This question paper contains
0I nh
printed
12 p
fn)
ages.l
7 Given a dataset with six records about startup Your Roll N o....,..,.......
companies, each record has two fields: Number of
Sr. No. of Question Ptper: 4192 H
Clients and Annual Turnover. Assuming that k = 2
and initial cluster centres as the first two records, Unique Paper Code : 2343012005
compute the cluster centres of the resulting clusters
) ) Name of the Paper Data Mining I
until the stopping criterion is met. Use Euclidean
distance as the distance metric. Also, cornpute the Name of the Course B.Sc. (Hons.) C om P uter
SSE (Sum of Squared Error) of each generated cluster. Science
Semester IV
Number of Annua1 Turnover
clients (in Lakhs) Duration: 3 Hours Maximum Marks :90
185 72
170 56
r,68 60
o6
lnstructions for Candidates
L19
L82
1,8 8 (15) Write your Roll No. on the top immediately on receipt
of this question paper.
) ) 2. Section A (Question No. 1) 1S compulsory
3 Attempt any four questions from Section B
(Questions 2 lo 7).
4. The use of a simple calculator is allowed
5. Parts of the question must be answered together
(1000) P.T.O
4792 2 4t92 11
Section A
ID A9e Fever BD Oulcome
Young Yes High In ICU
(a) Differentiate between the unsupervised and Young No High Hospital ized
P3 E Ide rfy Hi.gh In ICU
supervised evaluation measures used for cluster MiddIe
P4 Moderate In ICU
validity. (3) aged
P5 MiddIe No High Hone Care
aged
(b) What is the anti-monotone property of the support ) ) P6 MiddIe Yes Moderate In ICU
measure in association rule mining? Does the aged
P7 EIderIy No Moderate In ICU
confidence measure follow anti-monotone P8 Elderly No High Deceased
property? (3)
P9 E Ide rLy High In ICU
P10 Younq No High HospitalizeA
BD: Breathing Difficulty
(c) Consider a dataset with two class labels, News
and Entertainment, and six labeled documents D l -
(a) Compute the Gini Index of Age, Fever, and BD
D6. A new document, D7, is to be classified. The
similarity values of D7 with D1, DZ, D3, D4, D5 attributes. Given that you construct a decision tree
and D6 are 0.75,0.85,0.66,0.87,0.70 and 0.84 using the Gini Index as the splitting criteria, which
respectively. Using the k-Nearest Neighbor of the three attributes would you choose at the
classifier, predict the class label that should be root? Justify your choice. (9)
assigned to D7 when k:3 . Will the predicted class
label change with k=5? (4) (b) Compute the Gini Index of ID. Why should it not
) ) be used as a splitting attribute for constructing a
Document Class tabel
Dl- News
decision tree? (3)
D2 Entertainment
D3 Entertainment
(c) Civen ten objects in the dataset (p1-p10),
D4 News mention all train and test distributions for
D5 News performing k-fold cross-validation. Assume the
D6 Entertainment
value of k = 5. (3)
P. T. O.
4192 10 4192 3
(i) List the confusion matrix for "Classifier (d) Consider the given dataset, which contains six
objects, each with two attributes: Age and Salary.
A" and "Classifier B". Find the accuracy,
K-means clustering is used to cluster the given
precision, sensitivity, recall and specificity objects. Do you see any issue with applying K-
for each classifier. (8) means to the given dataset? If yes, then state the
issue. Also apply the appropriate preprocessing
)' ) technique to overcome it. If no, state explicitly
(ii) What problem may occur if the provided
that no preprocessing technique is required. (4)
training dataset of 500 patients had only
l5 positive instances and the remaining Age Salary
(in years ) (in rupees )
negative instances? Which performance
object 1 40 62000
measure would you choose to evaluate the Object 2 24 48000
object 3 30 54000
classifiers in such a scenario? Which is
object 4 35 6?000
the better classifier between Classifier A object 5 46 80000
in such a scenario?
object 6 ?i 66000
and Classifier B
(4)
(e) Define the curse of dimensionality. The Iris flower
(b) Consider a categorical attribute Grade with three dataset comprises of 150 data points and four
) ) features, namely sepal length, sepal width, petal
values {A, B, and C}. Convert this attribute to
width, and petal length. Is it a high-dimensional
asymmetric binary attributes. (3)
data or low-dimensional data? Justify your answer.
(4)
6 Consider the given COVID-19 dataset of ten (f) Consider a decision tree to classify the health of
patients. an individual as Fit or Unfit given below :
P.T.O.
4192 4 4192 9
(c) Enumerate all association rules generated from the
Age < 30 largest frequent itemset found in each dataset scan.
Yes No Compute the confidence of each generated rule.
Smokes,/ workout ?
Assuming that the minimum confidence threshold
Drinks ?
is 70%, find all the strong association rules.
Yes o Yes ) )
(6)
Unfit IIL Fir Dlet. Control ? (a) A medical team develops classification models for
5
predicting the occurrence of a ..genetic disorder,,
Y No
using Classifier A and Classifier B. patients having
Fir Unfit
genetic disorders are considered positive
instances. In contrast, negative instances are ones
with the absence of genetic disorders. The
(i) Extract all classification rules from the classifiers were tested on data from 500 patients
decision tree. and then obtained the result as :
(ii) Classify the following object: Actual tabeL
Plesence of Absence of
Age : 50, Workout : No, Smokes/ Drinks =
Genetic
Diso!de!
G6netic
Diso!der
Classl"fie! A, predicted
No, Diet Control: No, Health: ? (4) ) "presence of genetic 131
dlsorder"
Classifier A, pledlcted
(g) Classify the following tasks as ..predictive,, or *abselce of genetlc 19 195
disorde!"
"descriptive". Justify your answer. (4) classifie. B; pledicted
"presence of genetic 82 72
dlsordet"
(i) Foretelling whether an online user will shop CIas6ifi er B, predicted
"absence of genetic 68 27A
on Flipkart for a specific item. dlsorder,,
P.T.O
4192 8 4L92 5
(iii) What is an outlier? Spot an outlier in the (ii) Grouping the customers of a company
provided dataset. (3) according to their buying interests.
(b) What is the need for sampling in data mining ?
(iii) Finding a group of genes such that genes
What problems arise if the sample size is too small in each group have related functionality.
or too large? (3) )) (iv) Using historical data from previous financial
statements to project sales, revenue, and
1 Consider the following transactional data of a grocery expenses for a company.
store
(h) Given two objects X = (22,1,42, 10) and
Transaction 1D I tems
I.L Boots, ttoodiGl EliiEE- Y = (20,0,36,8), compute the distance between
T2 Boots, Hoodle these two objects using the following distance
T3 Hoodie, Coat, Cardj. an
T4 Cardi an, Coat measures:
15 Cardi an, Gloves
Hoodie, Coat, Cardigan (i) Euclidean D istance
(ii) Manhattan Distance (4)
(a) What is the maximum number of rules that can be
extracted from this data (including rules that have ) )
Section B
zero support). (3)
(b) Use the Apriori algorithm on the given transactional 2 (a) Given the following training dataset, compute all
dataset and compute the candidate and frequent class conditional and prior probabilities. Use the
itemsets for each dataset scan. Assume a support Naive Bayes approach to predict the class label
threshold of 33.34%. (6)
(Salary) for the test instance: (12)
P.T.O
4192 6 4192 7
Education Level = PG, Career Management,
Years of Experience = 3 to l0 IO Dept. Name Location Establish Si2e Annual
ed On Budget
DPT2 Finance Nehru 5-0L-2020 Large 460
Place
DP19 Marketlng Nehru 8-08 -2020 Med i.um 300
Education Career Years of salary Place
Experience Human Hauz Khag 2-0r-2020 Mediwr 240
tevel DP21
Resourco
UG Management Less than 3 Low
Production 2-02-2020 Medium 290
UG Management 3to10 Low ) DP21
DP3 3 Resealch 6 NehrLr 4-07-2021 snall 90
PG Management Less than 3 High Devefopment Place
PG Sexvice More than 10 Low DP39 InfornatioD Hauz Khas 6-09-2020 Medi um 210
UG service 3to10 Low Technology
SaIes 9-09-2020 Large 5t 0
PG service 3to10 Hiqh DP41 NEhTU
Place
PG Management More lhan 10 High DP52 Custome! Ilauz Khas 2-10-2020 Mediun
PG Service less than 3 Low
UG Management More than 10 High DP5 5 Public Nehru 3-O3-202r Large 900
t'lore ttran f0 Re!.atLons Place
UG service Low
Annual Budget is In Lakhs
(b) A data mining application uses a particular type
of data. Cive one application for each of the (i) Identify the type of attributes ID, Dept.
following type : (3) Name, Location, Established On, Size, and
Annual Budget as nominal, ordinal, interval,
(i) Sparse dataset each.
) or ratio. Give justification for (6)
(i i) Spatio-'l emporal data
(ii) Suggest a technique for dealing with
(iii) Graph-based data missing values in the attribute Location.
Will the same technique apply to the
(a) Consider the following dataset having details about attribute Annual Budget? Justify. (3)
different departments of a company :
P.T.O