Data Management in Machine Learning

The document outlines various aspects of data management and preprocessing in machine learning, emphasizing their importance in preparing data for algorithms. It covers topics such as data types, acquisition methods, encoding techniques, normalization, dimensionality reduction, and the challenges of handling imbalanced datasets. Additionally, it discusses the implications of overfitting and underfitting, along with strategies to mitigate these issues.

Uploaded by

tiwarivaibhav383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views2 pages

Data Management in Machine Learning

Uploaded by

tiwarivaibhav383

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Assignment 2

1. Explain the role of data management in machine learning. Why is it important

before applying ML algorithms?
2. Differentiate between data acquisition and data preprocessing with suitable
examples.
3. What are structured, semi-structured, and unstructured data? Give two
examples of each.
4. Describe different methods of data acquisition in real-world ML projects.
5. A company collects daily sales records, customer reviews, and website
clickstreams. Classify these into structured, semi-structured, and unstructured
data.
6. Suppose you are building a wind forecasting ML model. List at least five
sources of data acquisition and discuss their challenges.
7. Explain the difference between data governance and data management in ML.
8. What is data representation? Why is it crucial for ML algorithms?
9. Differentiate between sparse and dense data representations with an example.
[Link] the categorical variable "Color = {Red, Blue, Green}" using one-
hot encoding and label encoding.
[Link] embeddings in ML. How do they differ from one-hot encoding?
12.A dataset contains the following categorical variable: Fruit = {Apple, Banana,
Mango, Apple, Mango}. Represent it using frequency encoding.
[Link] is data preprocessing, and why is it necessary in ML?
[Link] between normalization and standardization. Give formulas.
17.A dataset has Age values: [15, 20, 25, 40, 50]. Perform Min-Max
normalization to scale them between [0,1].
[Link] Z-score standardization for the dataset: [10, 20, 30, 40].
19.A dataset has missing values: [10, NaN, 30, NaN, 50]. Perform mean, median,
and forward fill imputation.
[Link] the difference between feature scaling and feature selection.
[Link] is outlier detection? Explain at least two methods to handle outliers.
[Link] dataset values: [2, 4, 4, 4, 5, 5, 7, 9], calculate the mean, variance, and
standard deviation.
[Link] dimensionality reduction. Differentiate between PCA and t-SNE.
24.A dataset contains two highly correlated features X1 and X2. Explain why
multicollinearity is a problem and how PCA helps.
25.A dataset has 1000 rows, out of which 100 rows contain missing values. What
percentage of data will be lost if you drop those rows? Discuss its impact.
[Link] a dataset: Height (cm) = [150, 160, 170, 180, 190]. Normalize using
Min-Max scaling to [–1, 1].
[Link] are building a fraud detection model. The dataset is highly imbalanced
(95% normal, 5% fraud). Suggest at least three preprocessing techniques to
handle this imbalance.
[Link] a binary encoding for the categorical variable: Size = {Small,
Medium, Large, Small, Large}.
29.A dataset for temperature readings has values: [25, 28, 30, 45, 27, 29, 500].
Identify the outlier and suggest one method to fix it.
30. Explain the main objective of PCA in machine learning.
[Link] mathematically how PCA maximizes variance along the new axes.
[Link] a 2D dataset, show step-by-step how to compute PCA manually
(covariance matrix, eigenvalues, eigenvectors).
33.A dataset has 10 features. After PCA, only 3 components are kept that explain
95% variance. Explain why dimension reduction is beneficial here.
[Link]: Compute the covariance matrix and first principal component for
dataset: X={(2,0),(0,2),(3,3),(4,4)}
[Link] the role of eigenvalues in PCA.
36. What happens if the data is not standardized before applying PCA?
37. What is the difference between an autoencoder and PCA?
[Link] and explain the architecture of an undercomplete autoencoder.
[Link] do autoencoders use non-linear activation functions while PCA is linear?
[Link] the input dimension is 100 and bottleneck layer is 10, what is the
compression ratio?
[Link] are denoising autoencoders and their applications?
[Link] autoencoders with variational autoencoders (VAEs).
[Link] how autoencoders can be used for anomaly detection.
[Link] does training an autoencoder require regularization (dropout, L1/L2)?
[Link] overfitting and underfitting with suitable examples.
[Link] the bias-variance trade-off curve and explain its importance.
[Link] a polynomial regression model fits training data with 99% accuracy
but test accuracy is 65%. Is it overfitting or underfitting? Explain.
[Link] 4 techniques to prevent overfitting in deep learning models.
[Link] role does cross-validation play in preventing overfitting?
[Link] how dropout works to reduce overfitting.
[Link] might increasing training data reduce overfitting?

Python Programming and Data Science Exercises
No ratings yet
Python Programming and Data Science Exercises
7 pages
Machine Learning Exam Answers 2024-2025
No ratings yet
Machine Learning Exam Answers 2024-2025
27 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
8 pages
PCA and Kernel PCA Explained
No ratings yet
PCA and Kernel PCA Explained
6 pages
FML 2025 Assignment 1 Guidelines
No ratings yet
FML 2025 Assignment 1 Guidelines
2 pages
Data Science Interview Questions Guide
No ratings yet
Data Science Interview Questions Guide
5 pages
Data Analysis Techniques and Algorithms
No ratings yet
Data Analysis Techniques and Algorithms
1 page
Data Science Exam Prep Question Bank
No ratings yet
Data Science Exam Prep Question Bank
3 pages
Data Management in Machine Learning
No ratings yet
Data Management in Machine Learning
9 pages
Machine Learning Final Assessment 2025
No ratings yet
Machine Learning Final Assessment 2025
25 pages
Data Science Essentials: Datasets & Python Libraries
No ratings yet
Data Science Essentials: Datasets & Python Libraries
7 pages
PCA Implementation and Analysis
No ratings yet
PCA Implementation and Analysis
15 pages
Python OOP and Data Analysis Concepts
No ratings yet
Python OOP and Data Analysis Concepts
31 pages
PCA Steps and Anomaly Detection Guide
No ratings yet
PCA Steps and Anomaly Detection Guide
21 pages
Feature Engineering & Dimensionality Reduction
No ratings yet
Feature Engineering & Dimensionality Reduction
38 pages
Linearization Techniques in Machine Learning
No ratings yet
Linearization Techniques in Machine Learning
20 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
5 pages
Machine Learning Exam Prep Guide
No ratings yet
Machine Learning Exam Prep Guide
41 pages
Handling Missing Data and Statistical Concepts
No ratings yet
Handling Missing Data and Statistical Concepts
16 pages
BCS602 Machine Learning Question Bank
No ratings yet
BCS602 Machine Learning Question Bank
4 pages
Data Science Quiz Questions and Answers
No ratings yet
Data Science Quiz Questions and Answers
7 pages
Feature Extraction and Dimensionality Reduction in ML
No ratings yet
Feature Extraction and Dimensionality Reduction in ML
14 pages
Data Science Interview Questions Guide
No ratings yet
Data Science Interview Questions Guide
7 pages
Machine Learning Exam Questions
No ratings yet
Machine Learning Exam Questions
2 pages
Machine Learning Concepts and Techniques
No ratings yet
Machine Learning Concepts and Techniques
5 pages
Internal Test II: Machine Learning Foundations
No ratings yet
Internal Test II: Machine Learning Foundations
9 pages
Data Science AI/ML Solutions Guide
No ratings yet
Data Science AI/ML Solutions Guide
9 pages
Comprehensive Python & Data Science Guide
No ratings yet
Comprehensive Python & Data Science Guide
13 pages
AI and Data Science Concepts Overview
No ratings yet
AI and Data Science Concepts Overview
9 pages
Machine Learning Data Preprocessing Guide
No ratings yet
Machine Learning Data Preprocessing Guide
2 pages
Big Data Concepts and Analysis Techniques
No ratings yet
Big Data Concepts and Analysis Techniques
11 pages
B.Tech CS AIML Machine Learning Assignments
No ratings yet
B.Tech CS AIML Machine Learning Assignments
11 pages
Data Science Interview Questions 2025
No ratings yet
Data Science Interview Questions 2025
6 pages
Key Concepts in Data Science and Python
No ratings yet
Key Concepts in Data Science and Python
3 pages
Data Science Fundamentals and Ethics
No ratings yet
Data Science Fundamentals and Ethics
8 pages
Machine Learning Algorithms and Concepts
No ratings yet
Machine Learning Algorithms and Concepts
9 pages
Data Preprocessing and Model Evaluation Guide
No ratings yet
Data Preprocessing and Model Evaluation Guide
5 pages
Applied Machine Learning Course Schedule
No ratings yet
Applied Machine Learning Course Schedule
29 pages
Feature Scaling and PCA Overview
No ratings yet
Feature Scaling and PCA Overview
20 pages
Perceptron and Classifier Homework Tasks
No ratings yet
Perceptron and Classifier Homework Tasks
3 pages
AIML Course Syllabus Overview
No ratings yet
AIML Course Syllabus Overview
14 pages
Python Data Preprocessing Techniques
No ratings yet
Python Data Preprocessing Techniques
10 pages
Supervised Learning FAQ: Data Analysis & Modeling
No ratings yet
Supervised Learning FAQ: Data Analysis & Modeling
4 pages
Foundation of Data Science Previous Year Question Paper
100% (1)
Foundation of Data Science Previous Year Question Paper
40 pages
Machine Learning Framework and Techniques
No ratings yet
Machine Learning Framework and Techniques
3 pages
Normalizing Data Sets in ML
No ratings yet
Normalizing Data Sets in ML
3 pages
Machine Learning Question Bank 2025-26
No ratings yet
Machine Learning Question Bank 2025-26
27 pages
Technical Machine Learning Q&A Guide
No ratings yet
Technical Machine Learning Q&A Guide
4 pages
Study Notes for Data Science Interviews
No ratings yet
Study Notes for Data Science Interviews
7 pages
Feature Engineering in Machine Learning
No ratings yet
Feature Engineering in Machine Learning
14 pages
Data Preprocessing for Heart Dataset
No ratings yet
Data Preprocessing for Heart Dataset
21 pages
Data Preprocessing on Heart Dataset
No ratings yet
Data Preprocessing on Heart Dataset
19 pages
Data Preprocessing with SimpleImputer
No ratings yet
Data Preprocessing with SimpleImputer
9 pages
Data Science Question Bank
50% (2)
Data Science Question Bank
3 pages
Online MBA Data Science Course Overview
No ratings yet
Online MBA Data Science Course Overview
12 pages
Data Science Question Bank for Models
No ratings yet
Data Science Question Bank for Models
3 pages
Machine Learning Concepts and Python Basics
No ratings yet
Machine Learning Concepts and Python Basics
3 pages
Perceptron and PCA for Data Analysis
No ratings yet
Perceptron and PCA for Data Analysis
9 pages
Pro Excel Financial Modeling: Building Models For Technology Startups
No ratings yet
Pro Excel Financial Modeling: Building Models For Technology Startups
9 pages
Computer Network Lab Guide
No ratings yet
Computer Network Lab Guide
31 pages
Linux Command Practice Lab 2
No ratings yet
Linux Command Practice Lab 2
14 pages
Intechligent Computer Repair Invoice
No ratings yet
Intechligent Computer Repair Invoice
1 page
Introduction to Windows Server 2019
No ratings yet
Introduction to Windows Server 2019
12 pages
ZSTD vs GZIP: A Compression Comparison
No ratings yet
ZSTD vs GZIP: A Compression Comparison
12 pages
LESSON 8 Legal and Ethical Issues in Media and Information
No ratings yet
LESSON 8 Legal and Ethical Issues in Media and Information
10 pages
LTE Radio Optimization KPIs Guide
No ratings yet
LTE Radio Optimization KPIs Guide
70 pages
DIC II BASIC SOFTWARE Basic Operation
100% (1)
DIC II BASIC SOFTWARE Basic Operation
246 pages
Set Up Windows Server 2022 & 11 on Hyper-V
No ratings yet
Set Up Windows Server 2022 & 11 on Hyper-V
7 pages
Cold Chain Equipment Maintenance Guide
No ratings yet
Cold Chain Equipment Maintenance Guide
128 pages
B.Tech Exam Paper on Automata Theory
No ratings yet
B.Tech Exam Paper on Automata Theory
2 pages
Effects of Farlight 84 on STEM Students
No ratings yet
Effects of Farlight 84 on STEM Students
11 pages
General Knowledge Tricks by Mukesh
No ratings yet
General Knowledge Tricks by Mukesh
156 pages
Research Methodology and Applications
No ratings yet
Research Methodology and Applications
2 pages
Daikin Inverter RZR-M Series Overview
No ratings yet
Daikin Inverter RZR-M Series Overview
34 pages
ZLS Series Lift Tables Owner's Manual
No ratings yet
ZLS Series Lift Tables Owner's Manual
22 pages
AutoCAD Electrical Wiring Guide
No ratings yet
AutoCAD Electrical Wiring Guide
25 pages
Hand Gesture Classification via EIT
No ratings yet
Hand Gesture Classification via EIT
11 pages
HCM Report Maker ALV Function Key Fix
No ratings yet
HCM Report Maker ALV Function Key Fix
3 pages
Wireless Technologies in Medical Research Exam
No ratings yet
Wireless Technologies in Medical Research Exam
1 page
SEIAA 241st Meeting Agenda - April 2022
No ratings yet
SEIAA 241st Meeting Agenda - April 2022
2 pages
Article Comparison on Records Management
No ratings yet
Article Comparison on Records Management
6 pages
Focus 2 Test 1 Answer Key
No ratings yet
Focus 2 Test 1 Answer Key
4 pages
KAVACH Equipment E-Tender Notice R8
No ratings yet
KAVACH Equipment E-Tender Notice R8
189 pages
6/12V Lead Acid Battery Charger Circuit
No ratings yet
6/12V Lead Acid Battery Charger Circuit
3 pages
Kodak's Digital Transition Challenges
No ratings yet
Kodak's Digital Transition Challenges
28 pages
Software & Database Job Skills Analysis
No ratings yet
Software & Database Job Skills Analysis
3 pages
LPF12-40 Battery Specifications Guide
No ratings yet
LPF12-40 Battery Specifications Guide
2 pages
String Operations in Prolog
No ratings yet
String Operations in Prolog
2 pages