0% found this document useful (0 votes)

71 views99 pages

Concepts of EDA, Outliers-Detection and Treatment

Uploaded by

shailaja.m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views99 pages

Concepts of EDA, Outliers-Detection and Treatment

Uploaded by

shailaja.m

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 99

UNIT-2

MLDS
Contents:
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
Exploratory Data Analysis (EDA)
• It is a crucial step in the data analysis process.
• It involves summarizing the main characteristics of the data, often
with visual methods.
• EDA is an iterative process that involves:
1.Understanding the initial structure and details of the data.
2.Cleaning and preprocessing the data.
3.Extracting and focusing on relevant subsets of data.
4.Computing statistical summaries.
5.Visualizing the data for better insights.
1. Knowing Initial Details
import pandas as pd
about Data:
• Understand the basic # Load dataset
df = pd.read_csv('your_dataset.csv')
structure and attributes
of the dataset. # View the first few rows
Steps: print(df.head())
•Load the dataset.
# Check data types and non-null values
•View the first few rows. print(df.info())
•Check the data types and
# Get a concise summary
non-null values.
print(df.describe())
•Get a concise summary of
the dataset.
2. Modifying or Removing
# Handling missing values
Unwanted Data:
• Clean the data by df = df.dropna()
# Drop rows with missing values
handling missing values,
# (or)
outliers, and irrelevant df = df.fillna(method='ffill’)
features. # Forward fill to handle missing values
Steps:
# Remove duplicate rows
•Identify and handle missing df = df.drop_duplicates()
values.
•Remove duplicate rows. # Drop irrelevant columns
df = df.drop(['column_name1',
•Drop irrelevant columns. 'column_name2'], axis=1)
3. Retrieving Data:
• Extract specific subsets of data for focused analysis.
Steps:
•Filter rows based on conditions.
•Select specific columns.
•Group data by categorical variables.
# Filter rows based on conditions

filtered_df = df[df['column_name'] > value]

# Select specific columns

selected_columns = df[['column1', 'column2']]

# Group data by categorical variables

grouped_df = df.groupby('category_column').mean()
4. Getting Statistical Data: # Basic statistics
• Calculate summary
statistics to understand mean_value = df['column_name'].mean()
median_value = df['column_name'].median()
the data distribution. mode_value = df['column_name'].mode()
Steps:
•Compute basic statistics # Variance and standard deviation
(mean, median, mode, etc.). variance = df['column_name'].var()
•Calculate variance and std_deviation = df['column_name'].std()
standard deviation.
# Correlation matrix
•Identify the correlation
between variables. correlation_matrix = df.corr()
print(correlation_matrix)
5. Drawing Graphs/Plots
Objective: Visualize data to identify patterns, trends, and
outliers.
Steps:
•Create histograms for distribution.
•Use box plots for identifying outliers.
•Plot scatter plots for relationship between variables.
•Draw line plots for time series data.
•Generate heatmaps for correlation matrices.
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
# Line plot (time series)
df['column_name'].hist()
df['column_name'].plot()
plt.show()
plt.show()
# Box plot
# Heatmap for correlation matrix
df.boxplot(column='column_name')
sns.heatmap(correlation_matrix,
plt.show()
annot=True)
plt.show()
# Scatter plot
plt.scatter(df['column1'], df['column2'])
plt.xlabel('column1')
plt.ylabel('column2')
plt.show()
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• An outlier is a data point significantly different from other data points in a
dataset.
• They can significantly impact the outlier analysis in machine learning and
interpretation of the data, so it is essential to detect them.
Causes of Outliers
1. Measurement Errors: Mistakes during data collection or recording
can introduce outliers.
Ex:
• Suppose a researcher is measuring the height of students in a school.
• If the measuring tape is not used correctly, say it’s not held straight, it
could result in a height reading that is significantly higher or lower than
the actual height.
• This incorrect measurement can introduce outliers in the dataset.
Causes of Outliers
2. Data Entry Errors: Typographical errors or incorrect data entries can
create outliers.
Example:
• When entering data manually into a computer system, a typographical
error might occur.
• For instance, if a person's age is recorded as 250 instead of 25 due to a
typing error, this will appear as an outlier in the age data of the
population.
Causes of Outliers

3. Experimental Errors: Issues with experimental procedures or

equipment can result in outliers.
Example:
• In a chemistry experiment, if the balance used to weigh chemicals is not
calibrated properly, it could give incorrect readings.
• For example, if 10 grams of a substance is actually 12 grams due to a
faulty balance, this erroneous data will be an outlier when compared to
other correct measurements.
Causes of Outliers
4. Natural Variability: Some outliers naturally occur due to the inherent
variability in data.

Example:
• In biological data, there might be a natural occurrence of outliers due to
genetic mutations or other factors.
• For instance, in a dataset measuring human heights, someone with
gigantism (a condition causing excessive growth) might be an outlier
due to their unusually tall stature compared to the average population.
Causes of Outliers
5. Sampling Errors: When a sample does not represent the population
well, outliers can emerge.
Example:
• If a survey on household incomes is conducted in a wealthy
neighborhood only, the results will not represent the general
population accurately.
• This skewed sampling can result in outliers when the data is compared
with a more representative sample of the population.
Detecting Outliers
• Detecting outliers is crucial because they can distort the overall
picture of the data and lead to incorrect conclusions if not
appropriately handled.
• Outliers can also affect the performance of many machine learning
models, as they can skew the results and lead to overfitting or poor
generalization.
• Thus, detecting outliers is essential for cleaning and preparing the
data for analysis and ensuring the results’ validity.
Detecting Outliers-Visual Inspection
• Plotting the data, such as using scatter plots or histograms, can help you
visually identify outliers that stand out from the rest of the data.

• Box Plots: A box plot displays data distribution and highlights outliers outside
the "whiskers.“

• Scatter Plots: These plots show individual data points and help identify
outliers in the context of two variables.
Box Plots
• A box plot (also known as a whisker plot) is a standardized way
of displaying the distribution of data based on a five-number
summary.
• Five-Number Summary:
• Minimum: The smallest data point excluding outliers.
• First Quartile (Q1): The median of the lower half of the dataset.
• Median: The middle value of the dataset.
• Third Quartile (Q3): The median of the upper half of the dataset.
• Maximum: The largest data point excluding outliers.
• Box Plot Structure:

• Box: The box represents the interquartile range (IQR), which is

the range between the first quartile (Q1) and the third
quartile (Q3). It contains the middle 50% of the data.

• Whiskers: Lines extending from the box to the smallest and

largest data points within 1.5 times the IQR from the first and
third quartiles.

• Outliers: Data points outside the whiskers are considered

outliers and are plotted as individual points.
• In a data set 1,2,3,4,5,11,11,12,14,20 n=10, the equation will be
¼(10+1) which equals 11/4 or 2.75.
• This means that the first Quartile is located at position 2.75 in the data
set i.e., between the 2nd and 3rd numbers in the data set
Example:
• If you have a dataset of exam scores, a box plot can quickly show you
the overall distribution of scores, the median score, and any unusually
high or low scores (outliers).
Scatter Plots
• A scatter plot is a type of data visualization that displays individual data
points on a two-dimensional graph.
• Scatter plots often have a pattern.
• We call a data point an outlier if it doesn't fit the pattern.
• It helps in identifying relationships between two variables and spotting
outliers.
• Data Points: Each point on the scatter plot represents an individual
observation with its position determined by the values of the two
variables.
Example:
• Imagine you are analyzing the relationship between study hours and exam scores.
• A scatter plot will show you how exam scores change with different study hours.
• Points that are far away from the general cluster of data might indicate outliers
(e.g., a student who studied very little but scored very high or vice versa).
• Box Plots are useful for summarizing data distributions and spotting
outliers in a single variable.

• Scatter Plots help visualize relationships between two variables and

identify outliers within the context of this relationship.
Statistical Methods for Detecting
Outliers
Z-Scores
• A Z-score is a statistical measure that describes how far a data point is
from the mean of a dataset, in terms of standard deviations.
• It helps in identifying outliers by showing how unusual a data point is
compared to the rest of the data.
Interpreting Z-Scores:
• A Z-score of 0 means the data point is exactly at the mean.
• Positive Z-scores indicate data points above the mean.
• Negative Z-scores indicate data points below the mean.
• Typically, a Z-score above 3 or below -3 is considered an outlier.
• This means the data point is more than 3 standard deviations away from the mean, which is quite unusual.
Inter Quartile Range (IQR) Method
• The Inter Quartile Range (IQR) method is a robust statistical approach for
detecting outliers based on the spread of the middle 50% of the data.
1. Calculation of IQR:
• First Quartile (Q1): The median of the lower half of the dataset (25th
percentile).
• Third Quartile (Q3): The median of the upper half of the dataset (75th
percentile).
• IQR: The range between Q1 and Q3:
IQR=Q3−Q1
2. Identifying Outliers:

• Data points that fall

• below Q1−1.5×IQR or
• above Q3+1.5×IQR
• are considered outliers.

• This rule is based on the idea

that most of the data should lie
within 1.5 times the IQR from
the quartiles.
• Z-Scores
• provide a way to determine how far a data point is from the mean, in terms of
standard deviations.
• Typically, Z-scores above 3 or below -3 indicate outliers.
• IQR Method
• uses the spread of the middle 50% of the data to define a range.
• Data points outside 1.5 times the IQR from the first and third quartiles are
considered outliers.
• Both methods are effective for identifying outliers and are widely used
in data analysis to ensure accurate results and insights.
Outlier Detection in Machine Learning

• There are various outlier

detection techniques in
machine learning that are
categorized as supervised
methods, semi-
supervised methods, and
unsupervised methods.
1. Supervised methods:

• These methods use labeled data to identify outliers.

• For example, a supervised outlier detection algorithm may use a decision tree
or a random forest to classify data points as outliers or non-outliers based on
the features of the data.
• Decision Tree:
• A tree-like model that splits the data into branches to make
decisions based on labeled training data.
• A decision tree is a series of yes/no questions that help us
sort and group data.
• Each question splits the data into smaller and smaller
groups based on the answers.
• Classifies data points as outliers or non-outliers based on
learned rules from the training set.
• Imagine you have a small set of numbers, which represent the ages of
people in a group:
ages = [5, 6, 6, 7, 8, 9, 10, 100]

How Can a Decision Tree Detect Outliers?

• When we use a decision tree to look at our data, it tries to group similar
ages together.
Step 1: Sort the Ages
sorted_ages = [5, 6, 6, 7, 8, 9, 10, 100]
Step 2: Create Questions to Split the Data
A decision tree will ask questions like:
Is the age less than 6.5? (Splits into [5, 6, 6] and [7, 8, 9, 10, 100])
Is the age less than 8.5? (Splits into [5, 6, 6, 7, 8] and [9, 10, 100])
Is the age less than 50? (Splits into [5, 6, 6, 7, 8, 9, 10] and [100])
Step 3: Look at the Groups
• After splitting the data, we end up with groups (also called leaves) of
ages:
• Group 1: [5, 6, 6]
• Group 2: [7, 8, 9, 10]
• Group 3: [100]
Step 4: Identify Outliers
• Outliers are in groups with very few data points.
• In this case, the group [100] has only one data point, making it an
outlier.
• The age 100 ends up alone in its own group, making it an outlier.
• By looking at which data points end up in small or unique groups, we can detect outliers using
a decision tree.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

data = [
{"Size": 1500, "Price": 300},
{"Size": 1600, "Price": 320},
Dataset: {"Size": 1700, "Price": 340},
•A list of dictionaries containing Size and {"Size": 1800, "Price": 360},
{"Size": 1900, "Price": 380},
Price values, with some obvious outliers. {"Size": 2000, "Price": 400},
{"Size": 2500, "Price": 1000}, # Outlier
{"Size": 2600, "Price": 1050}, # Outlier
{"Size": 1700, "Price": 30}, # Outlier
]
# Convert the data into numpy arrays • Convert the dataset into NumPy arrays for
easier manipulation and feeding into the
X = np.array([[d["Size"]] for d in data]) model.
y = np.array([d["Price"] for d in data])

# Create and fit the decision tree model • Use the DecisionTreeRegressor to fit the
model to the data.
model = DecisionTreeRegressor()
model.fit(X, y)

# Predict prices using the trained model • Use the trained model to predict
prices based on Size.
predictions = model.predict(X)
•Compute the absolute differences between actual prices and predicted prices.

# Calculate residuals (absolute differences between actual and predicted prices)

residuals = np.abs(predictions - y)

• Determine outliers based on a set threshold for residuals (e.g., 100).

# Set a threshold for detecting outliers (e.g., residual > 100)

threshold = 100
outliers = np.where(residuals > threshold)[0]
# Print out the detected outliers
•Print the indices and details of detected
print("Detected outliers at indices:", outliers)
outliers.
for index in outliers:
print(data[index])

# Plot the results

•Plot the actual data, plt.scatter(X, y, color='blue', label='Actual Data’)

predicted prices, and plt.plot(X, predictions, color='green', label='Predicted Prices',
linewidth=2)
highlight the outliers. plt.scatter(X[outliers], y[outliers], color='red', label='Outliers',
marker='x', s=100)
plt.xlabel('Size (sq ft)’)
plt.ylabel('Price (in thousands)’)
plt.legend()
plt.title('Outlier Detection using Decision Trees’)
plt.show()
• Output:
• Random Forest:

• Outlier detection using Random Forests can be effectively

demonstrated through the concept of the Isolation Forest, which is
specifically designed for anomaly detection.

• Unlike traditional Random Forests used for classification or

regression, the
• Isolation Forest isolates observations by randomly selecting a feature and
then randomly selecting a split value between the maximum and minimum
values of the selected feature.
Isolation Forest:
• A tree-based algorithm that identifies outliers by isolating them from the rest of the
data.
Step 1: Import Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

Step 2: Create a Simple Dataset

data = {
'house_price': [200000, 250000, 300000, 350000, 400000, 450000, 500000, 550000, 600000, 10000000] # Note the outlier}
df = pd.DataFrame(data)

Step 3: Visualize the Data

plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('House Prices Scatter Plot')
plt.show()
Step 4: Apply Isolation Forest

# Initialize the Isolation Forest model

iso_forest = IsolationForest(contamination=0.1) # Here, we assume 10% of the data could be outliers

# Fit the model to the data

iso_forest.fit(df[['house_price']])
# Predict the outliers
df['outlier'] = iso_forest.predict(df[['house_price’]])
# The model assigns -1 to outliers and 1 to inliers
Step 5: Visualize the Outliers

plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue', label='Inliers')
plt.scatter(df[df['outlier'] == -1].index, df[df['outlier'] == -1]['house_price'], color='red',
label='Outliers')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('Outlier Detection in House Prices using Isolation Forest')
plt.legend()
plt.show()

Step 6: Examine the Results

print(df)
2. Semi-supervised methods:

• These methods use a combination of labeled and unlabeled data to identify

outliers.
• For example, a semi-supervised outlier detection algorithm may use clustering
to group similar data points together and then use the labeled data to identify
outliers within the clusters.
Outliers within the Clusters
Example:Let's use a simple dataset with labeled and unlabeled points.
• We'll apply a clustering algorithm to group similar data points and then use labeled data to
detect outliers within these clusters.

Steps:

• Create the Dataset: We'll create a dataset with labeled (normal and outlier) and unlabeled
points.

• Cluster the Data: Use a clustering algorithm (e.g., KMeans) to group similar data points.

• Identify Outliers within Clusters: Use the labeled data to identify which points within each
cluster are outliers.
Steps:
1.Dataset:
• Contains labeled points (normal and outlier) and unlabeled points.
2.Clustering:
• Use KMeans to cluster the data into two groups.
3.Calculate Distances:
• Compute the distances of labeled points to their respective cluster centers.
4.Set Threshold:
• Determine a threshold for identifying outliers (e.g., mean + 2*std of distances).
5.Identify Outliers:
• Flag points with distances greater than the threshold as outliers.
6.Plot the Results:
• Visualize the clusters, labeled data, and detected outliers.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

# Sample dataset with labeled (1: normal, -1: outlier) and unlabeled (0: unknown) points

data = [
{"Size": 1500, "Price": 300, "Label": 1},
{"Size": 1600, "Price": 320, "Label": 1},
{"Size": 1700, "Price": 340, "Label": 1},
{"Size": 1800, "Price": 360, "Label": 1},
{"Size": 1900, "Price": 380, "Label": 1},
{"Size": 2000, "Price": 400, "Label": 1},
{"Size": 2500, "Price": 1000, "Label": -1}, # Outlier
{"Size": 2600, "Price": 1050, "Label": -1}, # Outlier
{"Size": 1700, "Price": 30, "Label": -1}, # Outlier
{"Size": 1750, "Price": 350, "Label": 0}, # Unlabeled
{"Size": 1650, "Price": 310, "Label": 0}, # Unlabeled
]
# Convert the data into numpy arrays

X = np.array([[d["Size"], d["Price"]] for d in data])

labels = np.array([d["Label"] for d in data])

# Apply KMeans clustering

kmeans = KMeans(n_clusters=2, random_state=42).fit(X)

clusters = kmeans.predict(X)

# Separate labeled data

labeled_data = X[labels != 0]
labeled_labels = labels[labels != 0]
# Calculate distances of labeled points to their cluster centers

closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, labeled_data)

distances = np.linalg.norm(labeled_data - kmeans.cluster_centers_[clusters[labels != 0]], axis=1)

# Set a threshold for detecting outliers (e.g., mean + 2*std of distances)

threshold = np.mean(distances) + 2 * np.std(distances)

outliers = labeled_data[distances > threshold]

# Print out the detected outliers

print("Detected outliers within clusters:")

for outlier in outliers:
print(outlier)
# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', marker='o', label='Unlabeled Data')
plt.scatter(labeled_data[:, 0], labeled_data[:, 1], c=labeled_labels, cmap='coolwarm', marker='x',
label='Labeled Data')
plt.scatter(outliers[:, 0], outliers[:, 1], color='red', marker='s', s=100, label='Detected Outliers')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (in thousands)')
plt.legend()
plt.title('Semi-Supervised Outlier Detection using Clustering')
plt.show()
Output:
Detected outliers within clusters:
3. Unsupervised methods:

• These methods use only unlabeled data to identify outliers.

• For example, unsupervised outlier detection methods can use density-based or

distance-based methods to identify data points that are far away from the rest
of the data.

• Some popular unsupervised methods include the Local Outlier Factor (LOF),
k-nearest neighbor (KNN) based method, DBSCAN.
1. Local Outlier Factor (LOF)
• LOF measures the local density deviation of a data point compared to its
neighbors.
• Points with significantly lower density than their neighbors are
considered outliers.
2. k-Nearest Neighbors (k-NN)
• The k-NN method calculates the distance of a data point to its k-nearest
neighbors.
• Points far from their neighbors are flagged as outliers.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• DBSCAN groups data points into clusters based on density.
• Points that do not fit into any cluster are considered outliers (noise).
4. One-Class SVM
• One-Class Support Vector Machine (SVM) tries to separate the normal
data points from the outliers by finding a hyperplane that best encloses
the majority of the data points.
• In addition to these three main categories, there are also other methods for
outlier detection, such as
• ensemble methods that combine multiple methods or
• deep learning-based methods that use neural networks to identify outliers.

• It is necessary to note that the method for outlier detection will depend on
the specific characteristics of the data and the problem at hand.

• It’s also important to consider the trade-off between computational cost

and the accuracy of outlier detection.
Sorting the Data for Outlier Detection
• Sorting data is a fundamental step in data analysis that helps organize
the dataset, making it easier to identify outliers and discern patterns.

1. Sorting in Ascending/Descending Order

2. Grouping by Categories
Sorting in Ascending/Descending Order
Numerical Order:
1. Ascending Order: Sorting numerical data from the smallest to the largest value.
2. Descending Order: Sorting numerical data from the largest to the smallest value.
How It Helps:
• Highlighting Outliers: By sorting data, outliers become more apparent. For
example, if most values are clustered within a specific range, but a few values
are significantly higher or lower, these outliers will stand out at the ends of
the sorted list.
• Easier Visualization: Sorted data makes it easier to create visualizations, such
as box plots and scatter plots, where outliers and patterns can be more easily
identified.
Grouping by Categories
1.Categorical Data:
1. Sorting categorical data involves organizing the data into groups based on
categories.
How It Helps:
• Analyzing Each Group Separately: By sorting and grouping data by
categories, you can analyze each group independently. This makes it
easier to identify anomalies or patterns within each group.
• Comparing Groups: Sorting data by categories allows for comparison
between different groups, which can help identify any discrepancies or
outliers in specific categories.
Drawing Graphs/ Plots

Visualizing data with graphs and plots is an effective way to detect

and understand outliers:

1.Box Plot:

2.Scatter Plot

3.Histogram
Drawing Graphs/ Plots

1.Box Plot: A box plot shows the distribution of data and highlights
outliers as points outside the whiskers.
2. Scatter Plot:
• A scatter plot displays individual data points, helping identify
outliers in two-dimensional data.
3. Histogram: A histogram shows the frequency distribution of
data, with outliers appearing as bars separated from the main
cluster.
How to Handle Outliers
• Trimming refers to the process of removing a specified percentage of the highest and
lowest values in a dataset.

• Trimming Process
• Define the Trimming Percentage:
• Decide how much data you want to trim from the top and bottom ends of the
distribution.
• Common choices are 1%, 5%, or 10%, depending on the dataset and the context.
• Sort the Data:
• Sort the dataset in ascending order to easily identify the highest and lowest values.
• Determine the Cutoff Points:
• Calculate the indices for the values to be removed based on the trimming percentage.
• For example, if you decide to trim 5% from both ends of a dataset of 1000
observations, you would remove the lowest 50 and highest 50 values.
• Remove the Outliers:
• Exclude the determined highest and lowest values from your dataset.
Considerations of Trimming:

•Trimming can improve model performance by reducing the influence of

extreme values, but it may also lead to loss of valuable information.

•Ensure that the trimmed values are genuinely outliers rather than
important data points.

•Trimming is a more aggressive approach compared to capping or

winsorizing, as it completely removes data rather than adjusting its value.

•It’s helpful to visualize the data before and after trimming (using histograms
or box plots) to understand the impact of this method.
• Capping, also known as Winsorizing, is a method for handling outliers by replacing
extreme values with the nearest value within a specified range.
• This approach helps mitigate the impact of outliers on statistical analysis without
completely removing data points.
Capping Process
• Determine Capping Percentiles:
• Decide on the percentile thresholds for capping.
• Common choices are the 1st and 99th percentiles, or the 5th and 95th percentiles,
depending on how aggressive you want the capping to be.
• Calculate the Cutoff Values:
• Compute the values at the specified percentiles. For example, if using the 5th and
95th percentiles, you would find these values in your dataset.
• Replace Outliers:
• For values below the lower percentile (e.g., 5th), replace them with the value at the
lower percentile.
• For values above the upper percentile (e.g., 95th), replace them with the value at the
upper percentile.
Considerations

•Capping allows you to retain all data points, which can be

beneficial for analyses that require a full dataset.

•While capping reduces the influence of extreme values, it may

introduce bias if the capped values represent valid extreme cases.

•It’s useful to visualize the data before and after capping (using box
plots or histograms) to see the effect on the distribution.
Discretization
• It is the process of turning continuous data, like
heights or temperatures, into distinct categories or
groups.
• For example, instead of measuring height as a
precise number (like 5.4 feet), you might group it
into categories like "short," "average," and "tall."
• This helps simplify the data, making it easier to
analyze and understand.
Methods of Discretization
1. Equal Width Binning:
• The range of the continuous variable is divided into a specified number
of equal-width intervals.
• Example: If the range is from 0 to 100 and you want 5 bins, each bin
would cover a range of 20 (0-20, 21-40, etc.).

2. Equal Frequency Binning (Quantile Binning):

• The data is divided into bins such that each bin contains approximately
the same number of observations.
• Example: If you have 100 data points and want 4 bins, each bin will
contain 25 data points.
3. K-means Clustering:
•Use clustering algorithms like K-means to group data points into clusters,
then assign each point to a cluster (bin).
•This method can adapt to the distribution of the data rather than relying on
fixed intervals.
•Let's say we have a dataset of house prices and we want to discretize the
prices into bins.

Apply K-means Clustering

•Choose the number of clusters, k=3.
•Apply K-means clustering to partition the prices into 3 clusters.
4. Decision Tree-Based Binning:
•Decision trees can be used to determine optimal cut points for discretization
based on the target variable.
•This method creates bins that maximize the separation of different classes or
values.
•Let's consider a dataset for predicting house prices with a continuous
feature, "square footage," and the target variable, "price range" (e.g., low,
medium, high).
•Determine optimal cut points :
• The decision tree might determine that houses with square footage
• below 1000 tend to fall into the "low" price range,
• between 1000 and 2000 into the "medium" price range, and
• above 2000 into the "high" price range.
5.Custom Binning:
•Define your own bins based on domain knowledge or specific requirements
of the analysis.
Considerations:

•The method of discretization can significantly impact the performance of

machine learning models. Choose a method that aligns with the nature of
the data and the analysis goals.

•The number of bins should be carefully selected; too few bins can
oversimplify the data, while too many bins can lead to overfitting.

•Discretization may lead to a loss of information, so it’s essential to evaluate

the trade-offs between simplicity and accuracy.

Feature Engineering
No ratings yet
Feature Engineering
63 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Outliers
No ratings yet
Outliers
3 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
What Is Outlier
No ratings yet
What Is Outlier
3 pages
DATA 240 - 23 - Lec3 - FA 2024 - Dist
No ratings yet
DATA 240 - 23 - Lec3 - FA 2024 - Dist
50 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
1 Program
No ratings yet
1 Program
20 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Meran
No ratings yet
Meran
2 pages
Unit 1
No ratings yet
Unit 1
21 pages
DAUP Exam Notes - 2in1
No ratings yet
DAUP Exam Notes - 2in1
35 pages
Ds 5 Marks Final
No ratings yet
Ds 5 Marks Final
11 pages
Chapter 2
No ratings yet
Chapter 2
46 pages
Data Analysis for Outlier Detection
100% (1)
Data Analysis for Outlier Detection
28 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
12 pages
Das FFFF
No ratings yet
Das FFFF
16 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Detecting Data Outliers Guide
No ratings yet
Detecting Data Outliers Guide
7 pages
Outliers
No ratings yet
Outliers
5 pages
EDA Guide for Data Analysts
No ratings yet
EDA Guide for Data Analysts
35 pages
Research File 3
No ratings yet
Research File 3
10 pages
Feature Engineering Techniques Guide
No ratings yet
Feature Engineering Techniques Guide
69 pages
02data Part2
No ratings yet
02data Part2
34 pages
Lec06 7 Feature Engineering 08112022 100115am
No ratings yet
Lec06 7 Feature Engineering 08112022 100115am
44 pages
ML ch-1
No ratings yet
ML ch-1
32 pages
Outliers
No ratings yet
Outliers
3 pages
Outlier Detection Techniques
No ratings yet
Outlier Detection Techniques
28 pages
4 - SM and Data Visualization
No ratings yet
4 - SM and Data Visualization
61 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Fundamentals Stats
No ratings yet
Fundamentals Stats
44 pages
EFA in R
No ratings yet
EFA in R
32 pages
UNIT02
No ratings yet
UNIT02
41 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
ADS Imp Ans
No ratings yet
ADS Imp Ans
11 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Eda U2
No ratings yet
Eda U2
141 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Chapter 3 Exploratory Data Analysis
No ratings yet
Chapter 3 Exploratory Data Analysis
22 pages
Chapter Five
No ratings yet
Chapter Five
48 pages
Data Analytics Summary
No ratings yet
Data Analytics Summary
89 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
Unit 3
No ratings yet
Unit 3
20 pages
Module 3
No ratings yet
Module 3
108 pages
Data+Visualization+in+Python
No ratings yet
Data+Visualization+in+Python
17 pages
IMPDAV
No ratings yet
IMPDAV
105 pages
Data Preprocessing
No ratings yet
Data Preprocessing
56 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
4 ExploratoryAnalysis
No ratings yet
4 ExploratoryAnalysis
42 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Part2 Statistics
No ratings yet
Part2 Statistics
55 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
AI Beyond Classical Search &CSPs
No ratings yet
AI Beyond Classical Search &CSPs
116 pages
Unit-3 AI Propositional Logic
No ratings yet
Unit-3 AI Propositional Logic
124 pages
Unit 1 Ai, ML, Ann, DL and Examples
No ratings yet
Unit 1 Ai, ML, Ann, DL and Examples
68 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
23 pages
5 Forecasting PDF
No ratings yet
5 Forecasting PDF
24 pages
Tutorial
No ratings yet
Tutorial
28 pages
B.Tech Naval Architecture Syllabus
0% (1)
B.Tech Naval Architecture Syllabus
95 pages
Alice Nguni CV 3
No ratings yet
Alice Nguni CV 3
2 pages
Science Quarter 1 Module 1
No ratings yet
Science Quarter 1 Module 1
33 pages
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
No ratings yet
2007e Kempfert Becker - Empirical Axial Resistences Sheet Piles
7 pages
DATA9001 Assignment1 2025 Questions
No ratings yet
DATA9001 Assignment1 2025 Questions
3 pages
Engineering Statistics: Probability Basics
No ratings yet
Engineering Statistics: Probability Basics
12 pages
2025 LRR
No ratings yet
2025 LRR
29 pages
Presentation 3
100% (1)
Presentation 3
37 pages
Factors Affecting Cost Overruns in Construction Projects in KENHA
No ratings yet
Factors Affecting Cost Overruns in Construction Projects in KENHA
16 pages
MB0050-Research Methodology PDF
100% (2)
MB0050-Research Methodology PDF
217 pages
Your Toddler Month by Month 1st Edition Tanya Byron Download
No ratings yet
Your Toddler Month by Month 1st Edition Tanya Byron Download
90 pages
AP Statistics Chapter 5 Test Review Sheet
No ratings yet
AP Statistics Chapter 5 Test Review Sheet
2 pages
CUSUM Test for Steel I-Beam Quality
No ratings yet
CUSUM Test for Steel I-Beam Quality
4 pages
A Comparative Analysis of 100-Meter Sprint Among Gen-Z Regional Non-Athletics and Non-Athletes in Ilocos Norte
No ratings yet
A Comparative Analysis of 100-Meter Sprint Among Gen-Z Regional Non-Athletics and Non-Athletes in Ilocos Norte
4 pages
MID Sampling Plans Guide
No ratings yet
MID Sampling Plans Guide
16 pages
Advances and Opportunities in Process Data Analytics. - 1
No ratings yet
Advances and Opportunities in Process Data Analytics. - 1
9 pages
Marital Adjustment and Life Satisfaction PDF
No ratings yet
Marital Adjustment and Life Satisfaction PDF
9 pages
Eapp Q2 W1
50% (2)
Eapp Q2 W1
37 pages
Chapter6 ANOVA
No ratings yet
Chapter6 ANOVA
51 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
32 pages
Descriptive Statistics (II)
No ratings yet
Descriptive Statistics (II)
14 pages
MSC Chemistry 2sem Course 2. 4
No ratings yet
MSC Chemistry 2sem Course 2. 4
321 pages
BBA Syllabus 1st, 2nd, 3rd & 4th Sem
50% (2)
BBA Syllabus 1st, 2nd, 3rd & 4th Sem
46 pages
Cisco Press Computer Networking Data Analytics Developing in
No ratings yet
Cisco Press Computer Networking Data Analytics Developing in
469 pages
Netflix Prize: All Together Now: A Perspective On The
No ratings yet
Netflix Prize: All Together Now: A Perspective On The
1 page
Descartes g2 Utilizing Platic Waste As An Additives On Cement For Bricks Production, Tapos Na
No ratings yet
Descartes g2 Utilizing Platic Waste As An Additives On Cement For Bricks Production, Tapos Na
44 pages
H Ho Telling
100% (1)
H Ho Telling
16 pages
Algebra 2 Chapter 11 Notes
No ratings yet
Algebra 2 Chapter 11 Notes
24 pages