0% found this document useful (0 votes)
5 views

Lecture5

The document discusses various aspects of machine learning, focusing on feature engineering, handling missing data, and the differences between supervised and unsupervised learning. It highlights techniques for data normalization, scaling, and the importance of training, testing, and validation sets, including K-fold cross-validation. Additionally, it addresses the issues of overfitting and underfitting in model training.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Lecture5

The document discusses various aspects of machine learning, focusing on feature engineering, handling missing data, and the differences between supervised and unsupervised learning. It highlights techniques for data normalization, scaling, and the importance of training, testing, and validation sets, including K-fold cross-validation. Additionally, it addresses the issues of overfitting and underfitting in model training.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Special Topics of

Machine Learning in
Lecture

Cyber Security 05

Machine Learning Basics

Arslan Ali Khan arsl


[email protected]
Department of Cyber-Security and Data
Science Riphah Institute of Systems
Engineering (RISE), Riphah International
University, Islamabad, Pakistan.
Feature
Engineering
• Dealing with Missing Data
Missing values are data points that are absent for a specific
variable in a dataset. They can be represented in various ways,
such as blank cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant
challenge in data analysis and can lead to inaccurate or biased
results.
Feature
Engineering
• Dealing with Missing Data
Missing values can pose a significant challenge in data analysis, as
they can:
• Reduce the sample size: This can decrease the accuracy and
reliability of your analysis.
• Introduce bias: If the missing data is not handled properly, it
can bias the results of your analysis.
• Make it difficult to perform certain analysis: Some statistical
techniques require complete data for all variables, making
them inapplicable when missing values are present
Feature
Engineering
• Dealing with Missing Data
Using Estimated values:

• Replacing missing values with estimated values.

• Preserves sample size: Doesn’t reduce data points.

• Can introduce bias: Estimated values might not be accurate.

Use of Mean, Median, and Mode:

• Replace missing values with the mean, median, or mode of the relevant variable.

• Simple and efficient: Easy to implement.

• Can be inaccurate: Doesn’t consider the relationships between variables.


Feature
Engineering
• Handling Categorical Data
Categorical data is data that can be divided into groups or
categories, such as gender, hair color, or product type.
Feature
Engineering
• Normalizing Data
Normalization in machine learning is the process of translating
data into the range [0, 1] (or any other range).
• Feature Construction or Generation
Feature Generation (also known as feature construction, feature
extraction or feature engineering) is the process of transforming
features into new features that better relate to the target. This
can involve mapping a feature into a new feature using a
function like log, or creating a new feature from one or multiple
features using multiplication or addition.
Feature 5
6

Scaling
A technique often applied as part of data preparation for machine learning.
Goal: Change the values of numeric columns in the dataset to a common scale,
without
distorting differences in the ranges of values.

Normalization
Min-max normalization: Guarantees all features will have the exact same scale but
does not handle outliers well.

Z-score standardization: Handles outliers, but does not produce normalized data
with the
exact same scale.
Training, Testing and Validation 5
7

Sets
Training, Testing and 5
8

Validation Set
K-Fold Cross 5
9

Validation
K-fold cross-validation is a
technique for evaluating
predictive models.

The dataset is divided into k


subsets or folds. The model is
trained and evaluated k
times, using a diff erent fold
as the validation set each
time.

Performance metrics from


each fold are averaged to
estimate the model's
generalization performance.
K-Fold Cross 6
0

Validation
Under-fitting and Over- 6
1

fitting
• Overfitting occurs when the model fits the training data too well and does
Overfittin not generalize so it performs badly on the test data.
g • Its the result of an excessively complic ated model.

Underfitting occurs when the model does not fit the data well
Underfittin • enough.
Is result of an excessively simple model.
g•
Under-fitting and Over- 6
2

fitting

• Both overfitting and underfitting lead to poor predictions on new


datasets.

• A learning model that overfits or underfits does not generalize


well.
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
 Supervision: The training d a t a
(observations, measurements, etc.) are
a c c o m p a n i e d by labels indicating the
class of the observations
 New d a ta is classified based on the
training set
• Unsupervised learning (clustering)
 The class labels of training d a t a is unknown
 Given a set of measurements, observations,
4
etc. with the aim of establishing the
Machine
Learning
• Supervised: We are given input samples (X) a n d output
samples (y) of a function y = f(X). We would like to “learn”
f, a n d evaluate it on new data. Types:
 Classification: y is discrete (c lass la b e ls).
 Regression: y is c ontinuous, e.g. linear regression.

• Unsupervised: Given only samples X of the data, w e


c om p u te a
function f suc h that y = f(X) is “simpler”.
 Clustering: y is discrete
 Y is continuous: Matrix factorization, Kalman filtering, unsupervised
neural
networks.
Technique
s
• Supervised
Learning:
 Linear Regression
 Logistic Regression
 Decision Tree
 Naïve Bayes
 Random Forests
• Unsupervised
Learning:
 Clustering
 Factor analysis
 Topic Models
Regressi 7

on
Regression 8

Task
Regression 1
0

Task
Linear Regression Vs Logistic 1
1

Regression
Linear Regression Vs Logistic 1
2

Regression
Linear 1
3

Regression
Regression 1
4

Task
Linear 1
5

Regression

Y = mx +
c
Linear Regression 1
6

Example
Linear Regression 1
7

Example

You might also like