0% found this document useful (0 votes)
11 views

4_Outliers_+Transformaations ML

The document provides a comprehensive overview of outliers in data, defining them as significant deviations from the majority of data points and categorizing them into global, collective, and contextual types. It discusses various detection methods, including statistical, visual, distance-based, and model-based techniques, as well as the importance of addressing outliers to prevent biased models and reduced accuracy. Additionally, it outlines handling methods such as transformations, trimming, capping, imputation, and model-based approaches to mitigate the impact of outliers on data analysis.

Uploaded by

hetvibhora192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

4_Outliers_+Transformaations ML

The document provides a comprehensive overview of outliers in data, defining them as significant deviations from the majority of data points and categorizing them into global, collective, and contextual types. It discusses various detection methods, including statistical, visual, distance-based, and model-based techniques, as well as the importance of addressing outliers to prevent biased models and reduced accuracy. Additionally, it outlines handling methods such as transformations, trimming, capping, imputation, and model-based approaches to mitigate the impact of outliers on data analysis.

Uploaded by

hetvibhora192
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Outliers+

Transformations
Data
Outliers
definition : data points that deviate significantly and are either too large or too small compared to majority data
Types: 1) global 2) collective 3) contextual
Causes: 1. measurement, 2. data processing, 3. sampling, 4. natural variation

Detction methods: 1) statistical - 1. z score (used only when data is not skewed or close to normal distribution, otherwise IQR) , 2. IQR
2) Visual - 1. box plot ,2. scattter
3)Distance Based - 1. KNN , 2. LOF
4) Model based - 1. DBSCAN, 2. Hirarchical clutering

Importance: 1. baised models


2. reduced accuracy
3. incuresed variance
4. reduced interpretability
--- increase in Bais, Variance and Decrease in Accuracy, Interpretability

Handling Methods to eliminate outliers


Transforming - 1. Log Transformation 2. Boc Cox transformation
Trimming & Capping
Imputation 1) Replacing Outliers
Model Based - 1. Robust Regression 2. Isolation forest

Transformation:
Function transformer: a) log b) square root c) reciprocal
Power transformer : a) boxcox b) Yeo johnson

Binning & Binarization

MICE
What is an Outlier?
An outlier is a data point that significantly deviates from
the rest of the data. It can be either much higher or much
lower than the other data points, and its presence can have
a significant impact on the results of machine learning
algorithms. They can be caused by measurement or
execution errors. The analysis of outlier data is referred to
as outlier analysis or outlier mining.

Referen
ce
https://www.geeksforgeeks.org/machine-learning-outlier/
IMP
There are certain set of algorithm can impact badly. Those algorithms are Linear
regression,Logistic regression, and adaboost.
The common patterns in this algorithm are that you calculate weights here.So
anytime you want to know whether an outlier will effect your model or not,
A simple way is to ask, "Are you working on a weight-based algorithm or not?”.If
the answer to this question is yes, then outliers can impact your model.
Some tree based algorithms, like decision trees,random forests,gradient
boosting, and XG boost, have outliers that do not have an impact.
Types of Outliers
0 Global Outliers 0 Collective Outliers 0 Contextual Outliers

1
Global outliers are Collective outliers are Contextual outliers are
isolated data points that
are far away from the
2 groups of data points
that collectively deviate
3 data points that deviate
significantly from the
main body of the data. significantly from the expected behavior within
They are often easy to overall distribution of a a specific context or
identify and remove. dataset. subgroup.

Referen https://www.geeksforgeeks.org/types-of-outliers-in-
data-mining/
How Outliers are
Potential Causes Formed?
Measurement Errors: Data Processing Errors:

• Human errors or faulty instruments • Errors introduced during data


can result in incorrect data points. cleaning, transformation, or
merging.
• Example: Typing errors in data
entry or a malfunctioning sensor. • Example: Incorrect scaling that
causes a value to be
disproportionately high.
Sampling Errors:
Natural Variation:
• Non-representative samples may
include extreme values that do not • Some outliers naturally occur due
reflect the population. to rare events or extreme
conditions.
• Example: Surveying only a specific
subgroup leading to biased data.. • Example: Exceptional weather
conditions in climate data
Referen https://careerfoundry.com/en/blog/data-analytics/what-is-an-outlier/#:~:text=2.%20how%20do%20outliers%20end%2
0up%20in%20datasets%3F
Methods to Detect Outliers
Statistical Referen https://www.geeksforgeeks.org/machine-learnin
g-outlier/......
ce:
Methods:
● Z-score: ● IQR (Interquartile
This method calculates the standard Range):
IQR identifies outliers as data points
deviation of the data points and falling outside the range defined by
identifies outliers as those with Z- Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1),
scores exceeding a certain threshold where Q1 and Q3 are the first and
(typically 3 or -3).
third quartiles, and k is a factor
This is used only when data is not skewed or
(typically 1.5).
close to normal distribution, otherwise IQR
● IQR method:
Methods to Detect Outliers
Visual
Methods:

Box Plot Scatter Plot


Highlights outliers as points outside the whiskers, Useful for visualizing outliers in bivariate
which represent 1.5 times the interquartile range data by plotting individual points.
(IQR)
Methods to Detect Outliers
Distance-Based
Methods

● K-Nearest Neighbors
(KNN) & Local
Outlier Factor (LOF):
● KNN identifies outliers as data points
whose K nearest neighbors are far away
from them.
● This method calculates the local density of
data points and identifies outliers as those
with significantly lower density compared
to their neighbors.

Referen https://www.geeksforgeeks.org/machine-learning
-outlier/.........
ce:
Methods to Detect Outliers
Clustering-Based Referen https://www.geeksforgeeks.org/machine-learning
-outlier/...........
Methods ce:
● Density-Based Spatial ● Hierarchical
Clustering of clustering:
Applications with Noise It involves building a hierarchy of
clusters by iteratively merging or
(DBSCAN):
It clusters data points based on splitting clusters based on their
their density and identifies outliers similarity. Outliers can be identified
as points not belonging to any as clusters containing only a single
cluster. data point or clusters significantly
smaller than others.
Importance of outlier detection
in machine learning

0 Biased models:
0 Reduced accuracy:
1 Outliers can bias estimates of 2 Outliers can introduce noise into the
parameters like mean, variance, data, making it difficult for a
and coefficients, resulting in machine learning model to learn the
misleading inferences. true underlying patterns.

0 Increased 0 Reduced
3 variance:
Outliers can increase the variance of a 4 interpretability:
Outliers can make it difficult to
machine learning model, making it more understand what a machine
sensitive to small changes in the data. learning model has learned from the
data.

Referen https://www.geeksforgeeks.org/machine-learning-outlier/................
Handling Methods to Eliminate
Outliers
Transforming Data:

Log Box-Cox
Compresses theTransformatio
range of data, reducing the impact of StabilizesTransformation
variance and normalizes the data.
large outliers.
n Referen https://github.com/dhirajnair04/Outlier_handling
Handling Methods to Eliminate
Outliers
Trimming/Capping:

Trimming Capping
Remove outliers by excluding data beyond a Limit extreme values by setting a maximum
certain percentile. threshold to reduce the impact of outliers.
Referen https://pub.towardsai.net/outlier-detection-and-treatm
ent.........
https://github.com/dhirajnair04/Outlier_handling
Handling Methods to Eliminate
Outliers
Imputation:

Replacing Outliers
Replace outliers with more typical values like the
Referen https://www.linkedin.com/pulse/when-ho...
............ mean,
median, or mode
https://github.com/dhirajnair04/Outlier_handling
ce:
Handling Methods to Eliminate
Outliers
Model-Based Methods:

Robust Isolation Forest


Less sensitive toRegression
outliers, ensuring that they don’t Detects outliers by isolating observations that are
disproportionately affect the mode different from the majority
Referen https://github.com/dhirajnair04/Outlier_handling
Thank
You!
Thank
You!

You might also like