0% found this document useful (0 votes)

7 views26 pages

Lecture 05

The document covers essential concepts in machine learning, focusing on feature engineering, data normalization, and model evaluation techniques such as K-fold cross-validation. It discusses the challenges of missing data, the importance of handling categorical data, and the implications of overfitting and underfitting in model performance. Additionally, it distinguishes between supervised and unsupervised learning, outlining various techniques used in each category.

Uploaded by

rhmhb29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views26 pages

Lecture 05

Uploaded by

rhmhb29

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Lecture

Machine Learning 05

Machine Learning Basics

Arslan Ali Khan

[Link]@[Link]
Department of Cyber-Security and Data Science
Riphah Institute of Systems Engineering (RISE),
Riphah International University, Islamabad, Pakistan.
Feature Engineering

• Dealing with Missing Data

Missing values are data points that are absent for a specific variable in a
dataset. They can be represented in various ways, such as blank cells,
null values, or special symbols like “NA” or “unknown.” These missing
data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.
Feature Engineering

• Dealing with Missing Data

Missing values can pose a significant challenge in data analysis, as they can:
• Reduce the sample size: This can decrease the accuracy and reliability
of your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias
the results of your analysis.
• Make it difficult to perform certain analysis: Some statistical
techniques require complete data for all variables, making them
inapplicable when missing values are present
Feature Engineering

• Dealing with Missing Data

Using Estimated values:

• Replacing missing values with estimated values.

• Preserves sample size: Doesn’t reduce data points.

• Can introduce bias: Estimated values might not be accurate.

Use of Mean, Median, and Mode:

• Replace missing values with the mean, median, or mode of the relevant variable.

• Simple and efficient: Easy to implement.

• Can be inaccurate: Doesn’t consider the relationships between variables.

Feature Engineering

• Handling Categorical Data

Categorical data is data that can be divided into groups or categories,
such as gender, hair color, or product type.
Feature Engineering

• Normalizing Data
Normalization in machine learning is the process of translating data into
the range [0, 1] (or any other range).
• Feature Construction or Generation
Feature Generation (also known as feature construction, feature
extraction or feature engineering) is the process of transforming features
into new features that better relate to the target. This can involve
mapping a feature into a new feature using a function like log, or
creating a new feature from one or multiple features using multiplication
or addition.
Feature Scaling 56

A technique often applied as part of data preparation for machine learning.

Goal: Change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values.

Normalization
Min-max normalization: Guarantees all features will have the exact same scale but does
not handle outliers well.

Z-score standardization: Handles outliers, but does not produce normalized data with the
exact same scale.
Training, Testing and Validation Sets 57
Training, Testing and Validation Set 58
K-Fold Cross Validation 59

K-fold cross-validation is a
technique for evaluating
predictive models.

The dataset is divided into k

subsets or folds. The model is
trained and evaluated k times,
using a different fold as the
validation set each time.

Performance metrics from each

fold are averaged to estimate the
model's generalization
performance.
K-Fold Cross Validation 60
Under-fitting and Over-fitting 61

• Overfitting occurs when the model fits the training data too well and does not
Overfitting generalize so it performs badly on the test data.
• Its the result of an excessively complic ated model.

• Underfitting occurs when the model does not fit the data well enough.
Underfitting • Is result of an excessively simple model.
Under-fitting and Over-fitting 62

• Both overfitting and underfitting lead to poor predictions on new datasets.

• A learning model that overfits or underfits does not generalize well.

Supervised vs. Unsupervised Learning

• Supervised learning (classification)

 Supervision: The training d a t a (observations,
measurements, etc.) are a c c o m p a n i e d by labels
indicating the class of the observations

 New d a ta is classified based on the training set

• Unsupervised learning (clustering)
 The class labels of training d a t a is unknown
 Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the d a t a
4
Machine Learning

• Supervised: We are given input samples (X) a n d output samples (y)

of a function y = f(X). We would like to “learn” f, a n d evaluate it on
new data. Types:
 Classification: y is discrete (c lass la b e ls).
 Regression: y is c ontinuous, e.g. linear regression.

• Unsupervised: Given only samples X of the data, we compu te a

function f suc h that y = f(X) is “simpler”.
 Clustering: y is discrete
 Y is continuous: Matrix factorization, Kalman filtering, unsupervised neural
networks.
Techniques
• Supervised Learning:
 Linear Regression
 Logistic Regression
 Decision Tree
 Naïve Bayes
 Random Forests
• Unsupervised Learning:
 Clustering
 Factor analysis
 Topic Models
Regression 7
Regression Task 8
Regression Task 10
Linear Regression Vs Logistic Regression 11
Linear Regression Vs Logistic Regression 12
Linear Regression 13
Regression Task 14
Linear Regression 15

Y = mx + c
Linear Regression Example 16
Linear Regression Example 17

Hanuman Chalisa in Telugu PDF
100% (2)
Hanuman Chalisa in Telugu PDF
4 pages
Operating System Laboratory Guide
No ratings yet
Operating System Laboratory Guide
115 pages
E-Voting Adoption Factors for Diaspora Voters
No ratings yet
E-Voting Adoption Factors for Diaspora Voters
10 pages
Challenges in Online Teaching
No ratings yet
Challenges in Online Teaching
1 page
Next.js Schulung: Framework Overview
No ratings yet
Next.js Schulung: Framework Overview
5 pages
Understanding TELNET Protocol Basics
No ratings yet
Understanding TELNET Protocol Basics
60 pages
Tool Head Assembly for Shaping Machine
No ratings yet
Tool Head Assembly for Shaping Machine
5 pages
CORDEX Climate Data for SAM & CAM
No ratings yet
CORDEX Climate Data for SAM & CAM
12 pages
WadzPay Chain Whitepaper
No ratings yet
WadzPay Chain Whitepaper
12 pages
Calpion CDF Submission Instructions
No ratings yet
Calpion CDF Submission Instructions
4 pages
ITE8987E/BX EC Pinout and Schematic
0% (1)
ITE8987E/BX EC Pinout and Schematic
1 page
FusionServer Pro 2288X V5 Server V100R005 Quick Start Guide 01
No ratings yet
FusionServer Pro 2288X V5 Server V100R005 Quick Start Guide 01
2 pages
PT Trialogix Integra Perkasa Overview
No ratings yet
PT Trialogix Integra Perkasa Overview
7 pages
Abaqus CAE Python Scripting Course
No ratings yet
Abaqus CAE Python Scripting Course
41 pages
OSCP Certification Overview and Details
No ratings yet
OSCP Certification Overview and Details
12 pages
Mastering Digital Photography Basics
100% (4)
Mastering Digital Photography Basics
18 pages
Contractor Premier League Scheme 2025
No ratings yet
Contractor Premier League Scheme 2025
2 pages
Namaste JavaScript Course Overview
No ratings yet
Namaste JavaScript Course Overview
4 pages
ICT Practical Report: Database & Presentation Tasks
No ratings yet
ICT Practical Report: Database & Presentation Tasks
5 pages
Career Management Fundamentals Q1 2026 Orientation Deck
No ratings yet
Career Management Fundamentals Q1 2026 Orientation Deck
26 pages
Tekomar XPERT User Manual Guide
No ratings yet
Tekomar XPERT User Manual Guide
11 pages
Year 10 Game Development Syllabus
No ratings yet
Year 10 Game Development Syllabus
73 pages
UniHCP: Unified Model for Human Vision
No ratings yet
UniHCP: Unified Model for Human Vision
20 pages
Cloud Computing in E-Government: Benefits & Challenges
No ratings yet
Cloud Computing in E-Government: Benefits & Challenges
8 pages
Lewis, D'Narius - The Power of Your Subconscious Mind - A Pocketbook Guide To Fulfilling Your Dreams
No ratings yet
Lewis, D'Narius - The Power of Your Subconscious Mind - A Pocketbook Guide To Fulfilling Your Dreams
37 pages
49th BCS Question Solutions PDF
No ratings yet
49th BCS Question Solutions PDF
49 pages
Cathay Pacific Data Leak Affects 9.4M Passengers
No ratings yet
Cathay Pacific Data Leak Affects 9.4M Passengers
7 pages
Cluster Analysis Techniques in Marketing
No ratings yet
Cluster Analysis Techniques in Marketing
25 pages
Critical Skills Review: Math Operations
No ratings yet
Critical Skills Review: Math Operations
254 pages
Intel IAA Enhances Memory Tiering Efficiency
No ratings yet
Intel IAA Enhances Memory Tiering Efficiency
10 pages

Lecture 05

Uploaded by

Lecture 05

Uploaded by

Lecture

Machine Learning Basics

Arslan Ali Khan

• Dealing with Missing Data

• Dealing with Missing Data

• Dealing with Missing Data

• Replacing missing values with estimated values.

• Preserves sample size: Doesn’t reduce data points.

• Can introduce bias: Estimated values might not be accurate.

Use of Mean, Median, and Mode:

• Simple and efficient: Easy to implement.

• Can be inaccurate: Doesn’t consider the relationships between variables.

• Handling Categorical Data

A technique often applied as part of data preparation for machine learning.

The dataset is divided into k

Performance metrics from each

• Both overfitting and underfitting lead to poor predictions on new datasets.

• A learning model that overfits or underfits does not generalize well.

• Supervised learning (classification)

 New d a ta is classified based on the training set

• Supervised: We are given input samples (X) a n d output samples (y)

• Unsupervised: Given only samples X of the data, we compu te a

You might also like