Assignment 2
1. Explain the role of data management in machine learning. Why is it important
before applying ML algorithms?
2. Differentiate between data acquisition and data preprocessing with suitable
examples.
3. What are structured, semi-structured, and unstructured data? Give two
examples of each.
4. Describe different methods of data acquisition in real-world ML projects.
5. A company collects daily sales records, customer reviews, and website
clickstreams. Classify these into structured, semi-structured, and unstructured
data.
6. Suppose you are building a wind forecasting ML model. List at least five
sources of data acquisition and discuss their challenges.
7. Explain the difference between data governance and data management in ML.
8. What is data representation? Why is it crucial for ML algorithms?
9. Differentiate between sparse and dense data representations with an example.
[Link] the categorical variable "Color = {Red, Blue, Green}" using one-
hot encoding and label encoding.
[Link] embeddings in ML. How do they differ from one-hot encoding?
12.A dataset contains the following categorical variable: Fruit = {Apple, Banana,
Mango, Apple, Mango}. Represent it using frequency encoding.
[Link] is data preprocessing, and why is it necessary in ML?
[Link] between normalization and standardization. Give formulas.
17.A dataset has Age values: [15, 20, 25, 40, 50]. Perform Min-Max
normalization to scale them between [0,1].
[Link] Z-score standardization for the dataset: [10, 20, 30, 40].
19.A dataset has missing values: [10, NaN, 30, NaN, 50]. Perform mean, median,
and forward fill imputation.
[Link] the difference between feature scaling and feature selection.
[Link] is outlier detection? Explain at least two methods to handle outliers.
[Link] dataset values: [2, 4, 4, 4, 5, 5, 7, 9], calculate the mean, variance, and
standard deviation.
[Link] dimensionality reduction. Differentiate between PCA and t-SNE.
24.A dataset contains two highly correlated features X1 and X2. Explain why
multicollinearity is a problem and how PCA helps.
25.A dataset has 1000 rows, out of which 100 rows contain missing values. What
percentage of data will be lost if you drop those rows? Discuss its impact.
[Link] a dataset: Height (cm) = [150, 160, 170, 180, 190]. Normalize using
Min-Max scaling to [–1, 1].
[Link] are building a fraud detection model. The dataset is highly imbalanced
(95% normal, 5% fraud). Suggest at least three preprocessing techniques to
handle this imbalance.
[Link] a binary encoding for the categorical variable: Size = {Small,
Medium, Large, Small, Large}.
29.A dataset for temperature readings has values: [25, 28, 30, 45, 27, 29, 500].
Identify the outlier and suggest one method to fix it.
30. Explain the main objective of PCA in machine learning.
[Link] mathematically how PCA maximizes variance along the new axes.
[Link] a 2D dataset, show step-by-step how to compute PCA manually
(covariance matrix, eigenvalues, eigenvectors).
33.A dataset has 10 features. After PCA, only 3 components are kept that explain
95% variance. Explain why dimension reduction is beneficial here.
[Link]: Compute the covariance matrix and first principal component for
dataset: X={(2,0),(0,2),(3,3),(4,4)}
[Link] the role of eigenvalues in PCA.
36. What happens if the data is not standardized before applying PCA?
37. What is the difference between an autoencoder and PCA?
[Link] and explain the architecture of an undercomplete autoencoder.
[Link] do autoencoders use non-linear activation functions while PCA is linear?
[Link] the input dimension is 100 and bottleneck layer is 10, what is the
compression ratio?
[Link] are denoising autoencoders and their applications?
[Link] autoencoders with variational autoencoders (VAEs).
[Link] how autoencoders can be used for anomaly detection.
[Link] does training an autoencoder require regularization (dropout, L1/L2)?
[Link] overfitting and underfitting with suitable examples.
[Link] the bias-variance trade-off curve and explain its importance.
[Link] a polynomial regression model fits training data with 99% accuracy
but test accuracy is 65%. Is it overfitting or underfitting? Explain.
[Link] 4 techniques to prevent overfitting in deep learning models.
[Link] role does cross-validation play in preventing overfitting?
[Link] how dropout works to reduce overfitting.
[Link] might increasing training data reduce overfitting?