0% found this document useful (0 votes)
3 views4 pages

Dsa - DK Question Paper

The document outlines an assessment for a Data Science and Analytics course at Anna University, focusing on various analytical techniques and machine learning applications. It includes questions on statistical testing, data modeling, and machine learning concepts, along with practical scenarios for students to analyze. The assessment is divided into three parts, covering theoretical and practical aspects of data analytics and machine learning.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Dsa - DK Question Paper

The document outlines an assessment for a Data Science and Analytics course at Anna University, focusing on various analytical techniques and machine learning applications. It includes questions on statistical testing, data modeling, and machine learning concepts, along with practical scenarios for students to analyze. The assessment is divided into three parts, covering theoretical and practical aspects of data analytics and machine learning.

Uploaded by

mytreyan197
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Roll No.

DEPARTMENTSY
(UNIVERSITYTechnology
ANNA UNIVERSITYof Information
Department SemesterVI
Analytics
Science and
IT5602 Data Assessment 1
(Regulation 2019)
Max. Marks: 50

Time: 1.5 hrs


2025
Date: 26th February model with analytical
solutions.
business problems and
ldentify the realworld mathematics background knowledge.
CO 1
problem with relevant apply suitable statisticaltesting
Solve analytical hypothesis and
CO 2
any real world decision making problem to analytics using Hadoop and MapReduce.
Convert
CO 3
simple applications involving data.
CO 4 Write and demonstrate for modeling and
storing
source frameworks
CO 5 Use open visualization using Python.
CO 6 Perform data analytics and Creating)
BL- Bloom's
Taxonomy Levels
- Applying, L4 - Analyzing, L5 - Evaluating, L6 -
Remembering, L2 - Understanding, L3
(L1-
PART- A(5 x 2 = 10 Marks)
(Answer all Questions)
Marks CO BL
Questions 2 1 3
Q. No contribute to data science?
machine learning 2 3 4
1 How does
increase in data generation, how can organizations
2 With the rapid
ensure data quality and accuracy?often assume data follows a normal 2 3 3
models
3 Why do machine learning
distribution? 2 2 4
How does a skewed distribution impact machine learning models? 2 2
important before building a machine
2
5 Why is univariate analysis
learning model?
PART- B(2 x 13= 26 Marks)
Questions Marks CO
Q. No
to predict customer 07 3
6 (a)(0) Amachine learningengineer is developing a model
satisfaction (Z) based on:
replies, in
Response time (X) (howquickly customer service
minutes)
Resolution time (Y) (how long it takes to resolve the issue, in
hours)
They calculate the correlation matrix:
1 0.78 -0.85]
R= 0.78 1 -0.92
-0.85-0.92 1
where:
r(X,Z) = -0.85 (Response time vs. Satisfaction)
r(Y,Z) = -0.92 (Resolution time vs, Satisfaction)
rX,Y) = 0.78 (Response time vs. Resolution time)
1
(A)Which
(B) How factor affects Customer satisfaction more?
6 (a) (ii)
should the companyimprove satisfaction?
A 3
self-drivingof a
probability
sensor
caruses radar sensors to detect pedestrians. The The
pedestrian being present at a given moment
has the following characteristics:
is 0.2.
prior 06 5

Irue Positive Rate: The sensor correctly detects a pedestrian


90% of the time (P(D+ | P) = 0.9).
False Positive Rate: The sensor incorrectly detects a
pedestrian 10% of the time when none is present (P(D+ | -P)
If the =0.1).
sensor
pedestrian
detects a pedestrian, what is the probability that a
is actually present?
Comment on the result.
6 (b) OR
Osing the below data, apply NBC to identify the species of an entity with 13 3 5
the following attributes.
X= {Color = Green, Legs = 2, Height =
SI. No.
Tall, Smelly = No}
Color Legs Height Smelly Species
1 White 3 Short Yes MA
2 Green 2 Tall No M
3 Green 3 Short Yes M
4 White 3 Short Yes I
5 Green 2 Short No
6 White 2 Tall No
7 White 2 Tall No
8 White 2 Short Yes H

7(a) A dataset has three features: X1, X2, X3 with the following covariance
matrix: 13 3 3

[2 1 0
C=1 2
0 12
) Compute the eigenvalues and
eigenvectors.
(i) ldentify the top 2 principal components.
(i) Reduce the dataset from 3D to 2D using PCA.
OR
7 (b) A binary classification dataset has two classes: 13 3 3
Class 1: X,= (1,2), (2,3), (3,3)}
Class 2: X2= {(6,5), (7,8), (8,8))
()Compute the class means mand m2.
(ii) Compute the scatter matrices: within-class scatter Sw and
between-class scatter S.
(ii)Compute the LDA projection vector.
(iv) Project the dataset onto the LDA axis.
2
PART- C(1x 14 =14 Marks)
(Q.No.8 is compulsory)
Q. No Questions Marks CO
to predict the
A car dealership wants to build a rearession model They have 14 1,2,
selling price of used cars based on historical data.
collected the following features:
V Age of the car (years)
V Mileage (miles driven)
V Engine size (liters)
Y Brand popularity score (0-10 scale)
The dataset has 5000 records, and after trainingaMultiple Linear
Regression model, the dealership finds the R² score is 0.55,
which is lower than expected.
You are hired as a machine learning expert toanalyze and improve
the model.

After training, the dealership gets the following regression


equation:
Price =30, 000 - 2,500xAge - 0. 05×Mile age +4,000×Engine Size
+1,500×Brand Popularity
(a) Interpret the regression coefficients. What do they mean in real
world terms?

(b) The model has an R² score of 0.55. What does this mean? Is it
good enough?
(c) What additional features might improve the model's accuracy?
Why?
(d) The dealership now adds more features, but the R²score on
training data rises to 0.90, while the test R²remains at 0.55. What
problem is occurring, and how can it be fixed?
Roll No.
Department of Information Technology
Semester VI (Regulation 2019)
IT5602 Data Science and Analytics
Assessment 1

Time: 1.5 hrs Max. Marks: 50


Date: 16th April 2025
Part-A (5x2 = 10)
1. What happens to bias and variance when you use a very complex model (e.g., a deep neural network) on a small dataset?
What are the risks of removing all outliers from adataset? 'o
3. Why does HDFS use large block sizes (e.g., 128MB) instead of small ones like traditional file systems?
4,What are the risks of using a flexible schema in MongoDB?
5. Why is HiveQL not suitable for real-time analytics?
Part-B (13x2 = 26)
6.(a) Consider a scenario,where you have a 2x2 Gridworld with 4 states: S1, S2, S3,S4.
You start in any state, and at each step, you can move Up, Down, Left, or Right, but if the move would take you off the grid,
you stay in the same state.
Alltransitions are deterministic.
The reward for all transitions is -1, and the discount factor y = 0.9.
You are given a uniform random policy: each action has equal probability (0.25).
Using the Bellman Expectation Equation:
V(s) = ) n(als)> P(s' |s, a) [R(s, a, s') +yV(s)]
a

Compute the value function for each state using iterative policy evaluation for 3 iterations, initialized at V(s) = 0 for all
states.
OR
6.(b) Consider the following Scenario:
You have a dataset of customer profiles including age (numerical),. gender (categorical), and browser used (categorical). You
try to use K-Means, but the results make little sense.
1. Explain why K-Means is not suitable for this type of data.
I1. Suggest appropriate alternatives.
Ill. How can you preprocess this data to make it more suitable for K-Means (if necessary)?
7.(a) in a company, a manager suggests using unsupervised clustering to build a customer churn prediction model, because
the dataset has no labels.
Critically evaluate this suggestion. What are the risks of using clustering for a classification problem? Propose a better
alternative if labels are unavailable.
OR
7.(b)Consider a scenario where you're building a recommendation engine and need to evaluate different algorithms. A
colleague suggests using a simple train/test split (80/20 hold-out) instead of K-Fold for faster experimentation.
Evaluate the pros and cons of using hold-out validation versus K-Fold in this scenario. When is each approach preferable?
Part-C (1x14 = 14)
8. You area data engineer at a large e-commerce company. Your team is planning to store and process petabytes of user
clickstream data. The data will be used for analytics, recommendation engines, and fraud detection. The CTO suggests using
Hadoop Distributed File System (HDFS) to store this data because of its scalability and fault tolerance.
However,your team is concerned about the following:
The average file size is only 1MB, but there are millions of files generated daily.
Youneed fast access to small files for real-time analytics.
V Storage nodes (DataNodes) are expected to fail occasionally due to hardware constraints.
The team is considering whether to increase or reduce the default block size (128MB).
V There's also a plan to store machine learning models and image data on the same cluster.
Critically evaluate the suitability of HDESfor this workload.
ldentify and explain at least three challenges this scenario poses for HDEs. and propose practical solutions or
workarounds for each.

You might also like