Dsa - DK Question Paper
Dsa - DK Question Paper
DEPARTMENTSY
(UNIVERSITYTechnology
ANNA UNIVERSITYof Information
Department SemesterVI
Analytics
Science and
IT5602 Data Assessment 1
(Regulation 2019)
Max. Marks: 50
7(a) A dataset has three features: X1, X2, X3 with the following covariance
matrix: 13 3 3
[2 1 0
C=1 2
0 12
) Compute the eigenvalues and
eigenvectors.
(i) ldentify the top 2 principal components.
(i) Reduce the dataset from 3D to 2D using PCA.
OR
7 (b) A binary classification dataset has two classes: 13 3 3
Class 1: X,= (1,2), (2,3), (3,3)}
Class 2: X2= {(6,5), (7,8), (8,8))
()Compute the class means mand m2.
(ii) Compute the scatter matrices: within-class scatter Sw and
between-class scatter S.
(ii)Compute the LDA projection vector.
(iv) Project the dataset onto the LDA axis.
2
PART- C(1x 14 =14 Marks)
(Q.No.8 is compulsory)
Q. No Questions Marks CO
to predict the
A car dealership wants to build a rearession model They have 14 1,2,
selling price of used cars based on historical data.
collected the following features:
V Age of the car (years)
V Mileage (miles driven)
V Engine size (liters)
Y Brand popularity score (0-10 scale)
The dataset has 5000 records, and after trainingaMultiple Linear
Regression model, the dealership finds the R² score is 0.55,
which is lower than expected.
You are hired as a machine learning expert toanalyze and improve
the model.
(b) The model has an R² score of 0.55. What does this mean? Is it
good enough?
(c) What additional features might improve the model's accuracy?
Why?
(d) The dealership now adds more features, but the R²score on
training data rises to 0.90, while the test R²remains at 0.55. What
problem is occurring, and how can it be fixed?
Roll No.
Department of Information Technology
Semester VI (Regulation 2019)
IT5602 Data Science and Analytics
Assessment 1
Compute the value function for each state using iterative policy evaluation for 3 iterations, initialized at V(s) = 0 for all
states.
OR
6.(b) Consider the following Scenario:
You have a dataset of customer profiles including age (numerical),. gender (categorical), and browser used (categorical). You
try to use K-Means, but the results make little sense.
1. Explain why K-Means is not suitable for this type of data.
I1. Suggest appropriate alternatives.
Ill. How can you preprocess this data to make it more suitable for K-Means (if necessary)?
7.(a) in a company, a manager suggests using unsupervised clustering to build a customer churn prediction model, because
the dataset has no labels.
Critically evaluate this suggestion. What are the risks of using clustering for a classification problem? Propose a better
alternative if labels are unavailable.
OR
7.(b)Consider a scenario where you're building a recommendation engine and need to evaluate different algorithms. A
colleague suggests using a simple train/test split (80/20 hold-out) instead of K-Fold for faster experimentation.
Evaluate the pros and cons of using hold-out validation versus K-Fold in this scenario. When is each approach preferable?
Part-C (1x14 = 14)
8. You area data engineer at a large e-commerce company. Your team is planning to store and process petabytes of user
clickstream data. The data will be used for analytics, recommendation engines, and fraud detection. The CTO suggests using
Hadoop Distributed File System (HDFS) to store this data because of its scalability and fault tolerance.
However,your team is concerned about the following:
The average file size is only 1MB, but there are millions of files generated daily.
Youneed fast access to small files for real-time analytics.
V Storage nodes (DataNodes) are expected to fail occasionally due to hardware constraints.
The team is considering whether to increase or reduce the default block size (128MB).
V There's also a plan to store machine learning models and image data on the same cluster.
Critically evaluate the suitability of HDESfor this workload.
ldentify and explain at least three challenges this scenario poses for HDEs. and propose practical solutions or
workarounds for each.