0% found this document useful (0 votes)
5 views

Outliers ML

The document discusses outliers in data, defining them as data points that significantly deviate from the rest and can impact machine learning results. It categorizes outliers into global, collective, and contextual types, and outlines potential causes such as measurement errors and natural variation. Additionally, it presents various methods for detecting and handling outliers, emphasizing their importance in maintaining model accuracy and interpretability.

Uploaded by

Hetvi Bhora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Outliers ML

The document discusses outliers in data, defining them as data points that significantly deviate from the rest and can impact machine learning results. It categorizes outliers into global, collective, and contextual types, and outlines potential causes such as measurement errors and natural variation. Additionally, it presents various methods for detecting and handling outliers, emphasizing their importance in maintaining model accuracy and interpretability.

Uploaded by

Hetvi Bhora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Outliers

in Data
Identification, Impact and Management

By: Dhiraj Nair –


D05
Jeet
Nakrani – D10
What is an Outlier?
An outlier is a data point that significantly deviates from
the rest of the data. It can be either much higher or much
lower than the other data points, and its presence can have
a significant impact on the results of machine learning
algorithms. They can be caused by measurement or
execution errors. The analysis of outlier data is referred to
as outlier analysis or outlier mining.

Referen
ce
https://www.geeksforgeeks.org/machine-learning-outlier/
Types of Outliers
0 Global Outliers 0 Collective Outliers 0 Contextual Outliers

1
Global outliers are Collective outliers are Contextual outliers are
isolated data points that
are far away from the
2 groups of data points
that collectively deviate
3 data points that deviate
significantly from the
main body of the data. significantly from the expected behavior within
They are often easy to overall distribution of a a specific context or
identify and remove. dataset. subgroup.

Referen https://www.geeksforgeeks.org/types-of-outliers-in-
data-mining/
How Outliers are
Potential Causes Formed?
Measurement Errors: Data Processing Errors:

• Human errors or faulty instruments • Errors introduced during data


can result in incorrect data points. cleaning, transformation, or
merging.
• Example: Typing errors in data
entry or a malfunctioning sensor. • Example: Incorrect scaling that
causes a value to be
disproportionately high.
Sampling Errors:
Natural Variation:
• Non-representative samples may
• Some outliers naturally occur due
include extreme values that do not
to rare events or extreme
reflect the population.
conditions.
• Example: Surveying only a specific
• Example: Exceptional weather
subgroup leading to biased data..
conditions in climate data
Referen https://careerfoundry.com/en/blog/data-analytics/what-is-an-outlier/#:~:text=2.%20how%20do%20outliers%20end%2
0up%20in%20datasets%3F
Methods to Detect Outliers
Statistical Referen https://www.geeksforgeeks.org/machine-learnin
g-outlier/......
ce:
Methods:
● Z-score: ● IQR (Interquartile
This method calculates the standard Range):
IQR identifies outliers as data points
deviation of the data points and falling outside the range defined by
identifies outliers as those with Z- Q1-k*(Q3-Q1) and Q3+k*(Q3-Q1),
scores exceeding a certain where Q1 and Q3 are the first and
threshold (typically 3 or -3). third quartiles, and k is a factor
(typically 1.5).
Methods to Detect Outliers
Visual
Methods:

Box Plot Scatter Plot


Highlights outliers as points outside the whiskers, Useful for visualizing outliers in bivariate
which represent 1.5 times the interquartile range data by plotting individual points.
(IQR)
Methods to Detect Outliers
Distance-Based
Methods

● K-Nearest Neighbors
(KNN) & Local
Outlier Factor (LOF):
● KNN identifies outliers as data points
whose K nearest neighbors are far away
from them.
● This method calculates the local density of
data points and identifies outliers as those
with significantly lower density compared
to their neighbors.

Referen https://www.geeksforgeeks.org/machine-learning
-outlier/.........
ce:
Methods to Detect Outliers
Clustering-Based Referen https://www.geeksforgeeks.org/machine-learning
-outlier/...........
Methods ce:
● Density-Based Spatial ● Hierarchical
Clustering of clustering:
Applications with Noise
It involves building a hierarchy of
(DBSCAN):
It clusters data points based on clusters by iteratively merging or
their density and identifies outliers splitting clusters based on their
as points not belonging to any similarity. Outliers can be identified
cluster. as clusters containing only a single
data point or clusters significantly
smaller than others.
Importance of outlier detection
in machine learning

0 Biased models:
0 Reduced accuracy:
1 Outliers can bias estimates of 2 Outliers can introduce noise into the
parameters like mean, variance, data, making it difficult for a
and coefficients, resulting in machine learning model to learn the
misleading inferences. true underlying patterns.

0 Increased 0 Reduced
3 variance:
Outliers can increase the variance of a 4 interpretability:
Outliers can make it difficult to
machine learning model, making it more understand what a machine
sensitive to small changes in the data. learning model has learned from the
data.

Referen https://www.geeksforgeeks.org/machine-learning-outlier/................
Handling Methods to Eliminate
Outliers
Transforming Data:

Log Box-Cox
Compresses theTransformatio
range of data, reducing the impact of StabilizesTransformation
variance and normalizes the data.
large outliers.
n Referen https://github.com/dhirajnair04/Outlier_handling
Handling Methods to Eliminate
Outliers
Trimming/Capping:

Trimming Capping
Remove outliers by excluding data beyond a certain Limit extreme values by setting a maximum threshold
percentile. to reduce the impact of outliers.
Referen https://pub.towardsai.net/outlier-detection-and-treatm
ent.........
https://github.com/dhirajnair04/Outlier_handling
Handling Methods to Eliminate
Outliers
Imputation:

Replacing Outliers
Replace outliers with more typical values like the
Referen https://www.linkedin.com/pulse/when-ho...
............ mean,
median, or mode
https://github.com/dhirajnair04/Outlier_handling
ce:
Handling Methods to Eliminate
Outliers
Model-Based Methods:

Robust Isolation Forest


Less sensitive toRegression
outliers, ensuring that they don’t Detects outliers by isolating observations that are
disproportionately affect the mode different from the majority
Referen https://github.com/dhirajnair04/Outlier_handling
Thank
You!

You might also like