Lecture
Machine Learning 05
Machine Learning Basics
Arslan Ali Khan
[Link]@[Link]
Department of Cyber-Security and Data Science
Riphah Institute of Systems Engineering (RISE),
Riphah International University, Islamabad, Pakistan.
Feature Engineering
• Dealing with Missing Data
Missing values are data points that are absent for a specific variable in a
dataset. They can be represented in various ways, such as blank cells,
null values, or special symbols like “NA” or “unknown.” These missing
data points pose a significant challenge in data analysis and can lead to
inaccurate or biased results.
Feature Engineering
• Dealing with Missing Data
Missing values can pose a significant challenge in data analysis, as they can:
• Reduce the sample size: This can decrease the accuracy and reliability
of your analysis.
• Introduce bias: If the missing data is not handled properly, it can bias
the results of your analysis.
• Make it difficult to perform certain analysis: Some statistical
techniques require complete data for all variables, making them
inapplicable when missing values are present
Feature Engineering
• Dealing with Missing Data
Using Estimated values:
• Replacing missing values with estimated values.
• Preserves sample size: Doesn’t reduce data points.
• Can introduce bias: Estimated values might not be accurate.
Use of Mean, Median, and Mode:
• Replace missing values with the mean, median, or mode of the relevant variable.
• Simple and efficient: Easy to implement.
• Can be inaccurate: Doesn’t consider the relationships between variables.
Feature Engineering
• Handling Categorical Data
Categorical data is data that can be divided into groups or categories,
such as gender, hair color, or product type.
Feature Engineering
• Normalizing Data
Normalization in machine learning is the process of translating data into
the range [0, 1] (or any other range).
• Feature Construction or Generation
Feature Generation (also known as feature construction, feature
extraction or feature engineering) is the process of transforming features
into new features that better relate to the target. This can involve
mapping a feature into a new feature using a function like log, or
creating a new feature from one or multiple features using multiplication
or addition.
Feature Scaling 56
A technique often applied as part of data preparation for machine learning.
Goal: Change the values of numeric columns in the dataset to a common scale, without
distorting differences in the ranges of values.
Normalization
Min-max normalization: Guarantees all features will have the exact same scale but does
not handle outliers well.
Z-score standardization: Handles outliers, but does not produce normalized data with the
exact same scale.
Training, Testing and Validation Sets 57
Training, Testing and Validation Set 58
K-Fold Cross Validation 59
K-fold cross-validation is a
technique for evaluating
predictive models.
The dataset is divided into k
subsets or folds. The model is
trained and evaluated k times,
using a different fold as the
validation set each time.
Performance metrics from each
fold are averaged to estimate the
model's generalization
performance.
K-Fold Cross Validation 60
Under-fitting and Over-fitting 61
• Overfitting occurs when the model fits the training data too well and does not
Overfitting generalize so it performs badly on the test data.
• Its the result of an excessively complic ated model.
• Underfitting occurs when the model does not fit the data well enough.
Underfitting • Is result of an excessively simple model.
Under-fitting and Over-fitting 62
• Both overfitting and underfitting lead to poor predictions on new datasets.
• A learning model that overfits or underfits does not generalize well.
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
Supervision: The training d a t a (observations,
measurements, etc.) are a c c o m p a n i e d by labels
indicating the class of the observations
New d a ta is classified based on the training set
• Unsupervised learning (clustering)
The class labels of training d a t a is unknown
Given a set of measurements, observations, etc.
with the aim of establishing the existence of classes
or clusters in the d a t a
4
Machine Learning
• Supervised: We are given input samples (X) a n d output samples (y)
of a function y = f(X). We would like to “learn” f, a n d evaluate it on
new data. Types:
Classification: y is discrete (c lass la b e ls).
Regression: y is c ontinuous, e.g. linear regression.
• Unsupervised: Given only samples X of the data, we compu te a
function f suc h that y = f(X) is “simpler”.
Clustering: y is discrete
Y is continuous: Matrix factorization, Kalman filtering, unsupervised neural
networks.
Techniques
• Supervised Learning:
Linear Regression
Logistic Regression
Decision Tree
Naïve Bayes
Random Forests
• Unsupervised Learning:
Clustering
Factor analysis
Topic Models
Regression 7
Regression Task 8
Regression Task 10
Linear Regression Vs Logistic Regression 11
Linear Regression Vs Logistic Regression 12
Linear Regression 13
Regression Task 14
Linear Regression 15
Y = mx + c
Linear Regression Example 16
Linear Regression Example 17