0% found this document useful (0 votes)
71 views99 pages

Concepts of EDA, Outliers-Detection and Treatment

Uploaded by

shailaja.m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views99 pages

Concepts of EDA, Outliers-Detection and Treatment

Uploaded by

shailaja.m
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 99

UNIT-2

MLDS
Contents:
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
Exploratory Data Analysis (EDA)
• It is a crucial step in the data analysis process.
• It involves summarizing the main characteristics of the data, often
with visual methods.
• EDA is an iterative process that involves:
1.Understanding the initial structure and details of the data.
2.Cleaning and preprocessing the data.
3.Extracting and focusing on relevant subsets of data.
4.Computing statistical summaries.
5.Visualizing the data for better insights.
1. Knowing Initial Details
import pandas as pd
about Data:
• Understand the basic # Load dataset
df = pd.read_csv('your_dataset.csv')
structure and attributes
of the dataset. # View the first few rows
Steps: print(df.head())
•Load the dataset.
# Check data types and non-null values
•View the first few rows. print(df.info())
•Check the data types and
# Get a concise summary
non-null values.
print(df.describe())
•Get a concise summary of
the dataset.
2. Modifying or Removing
# Handling missing values
Unwanted Data:
• Clean the data by df = df.dropna()
# Drop rows with missing values
handling missing values,
# (or)
outliers, and irrelevant df = df.fillna(method='ffill’)
features. # Forward fill to handle missing values
Steps:
# Remove duplicate rows
•Identify and handle missing df = df.drop_duplicates()
values.
•Remove duplicate rows. # Drop irrelevant columns
df = df.drop(['column_name1',
•Drop irrelevant columns. 'column_name2'], axis=1)
3. Retrieving Data:
• Extract specific subsets of data for focused analysis.
Steps:
•Filter rows based on conditions.
•Select specific columns.
•Group data by categorical variables.
# Filter rows based on conditions

filtered_df = df[df['column_name'] > value]

# Select specific columns


selected_columns = df[['column1', 'column2']]

# Group data by categorical variables


grouped_df = df.groupby('category_column').mean()
4. Getting Statistical Data: # Basic statistics
• Calculate summary
statistics to understand mean_value = df['column_name'].mean()
median_value = df['column_name'].median()
the data distribution. mode_value = df['column_name'].mode()
Steps:
•Compute basic statistics # Variance and standard deviation
(mean, median, mode, etc.). variance = df['column_name'].var()
•Calculate variance and std_deviation = df['column_name'].std()
standard deviation.
# Correlation matrix
•Identify the correlation
between variables. correlation_matrix = df.corr()
print(correlation_matrix)
5. Drawing Graphs/Plots
Objective: Visualize data to identify patterns, trends, and
outliers.
Steps:
•Create histograms for distribution.
•Use box plots for identifying outliers.
•Plot scatter plots for relationship between variables.
•Draw line plots for time series data.
•Generate heatmaps for correlation matrices.
import matplotlib.pyplot as plt
import seaborn as sns

# Histogram
# Line plot (time series)
df['column_name'].hist()
df['column_name'].plot()
plt.show()
plt.show()
# Box plot
# Heatmap for correlation matrix
df.boxplot(column='column_name')
sns.heatmap(correlation_matrix,
plt.show()
annot=True)
plt.show()
# Scatter plot
plt.scatter(df['column1'], df['column2'])
plt.xlabel('column1')
plt.ylabel('column2')
plt.show()
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• An outlier is a data point significantly different from other data points in a
dataset.
• They can significantly impact the outlier analysis in machine learning and
interpretation of the data, so it is essential to detect them.
Causes of Outliers
1. Measurement Errors: Mistakes during data collection or recording
can introduce outliers.
Ex:
• Suppose a researcher is measuring the height of students in a school.
• If the measuring tape is not used correctly, say it’s not held straight, it
could result in a height reading that is significantly higher or lower than
the actual height.
• This incorrect measurement can introduce outliers in the dataset.
Causes of Outliers
2. Data Entry Errors: Typographical errors or incorrect data entries can
create outliers.
Example:
• When entering data manually into a computer system, a typographical
error might occur.
• For instance, if a person's age is recorded as 250 instead of 25 due to a
typing error, this will appear as an outlier in the age data of the
population.
Causes of Outliers

3. Experimental Errors: Issues with experimental procedures or


equipment can result in outliers.
Example:
• In a chemistry experiment, if the balance used to weigh chemicals is not
calibrated properly, it could give incorrect readings.
• For example, if 10 grams of a substance is actually 12 grams due to a
faulty balance, this erroneous data will be an outlier when compared to
other correct measurements.
Causes of Outliers
4. Natural Variability: Some outliers naturally occur due to the inherent
variability in data.

Example:
• In biological data, there might be a natural occurrence of outliers due to
genetic mutations or other factors.
• For instance, in a dataset measuring human heights, someone with
gigantism (a condition causing excessive growth) might be an outlier
due to their unusually tall stature compared to the average population.
Causes of Outliers
5. Sampling Errors: When a sample does not represent the population
well, outliers can emerge.
Example:
• If a survey on household incomes is conducted in a wealthy
neighborhood only, the results will not represent the general
population accurately.
• This skewed sampling can result in outliers when the data is compared
with a more representative sample of the population.
Detecting Outliers
• Detecting outliers is crucial because they can distort the overall
picture of the data and lead to incorrect conclusions if not
appropriately handled.
• Outliers can also affect the performance of many machine learning
models, as they can skew the results and lead to overfitting or poor
generalization.
• Thus, detecting outliers is essential for cleaning and preparing the
data for analysis and ensuring the results’ validity.
Detecting Outliers-Visual Inspection
• Plotting the data, such as using scatter plots or histograms, can help you
visually identify outliers that stand out from the rest of the data.

• Box Plots: A box plot displays data distribution and highlights outliers outside
the "whiskers.“

• Scatter Plots: These plots show individual data points and help identify
outliers in the context of two variables.
Box Plots
• A box plot (also known as a whisker plot) is a standardized way
of displaying the distribution of data based on a five-number
summary.
• Five-Number Summary:
• Minimum: The smallest data point excluding outliers.
• First Quartile (Q1): The median of the lower half of the dataset.
• Median: The middle value of the dataset.
• Third Quartile (Q3): The median of the upper half of the dataset.
• Maximum: The largest data point excluding outliers.
• Box Plot Structure:

• Box: The box represents the interquartile range (IQR), which is


the range between the first quartile (Q1) and the third
quartile (Q3). It contains the middle 50% of the data.

• Whiskers: Lines extending from the box to the smallest and


largest data points within 1.5 times the IQR from the first and
third quartiles.

• Outliers: Data points outside the whiskers are considered


outliers and are plotted as individual points.
• In a data set 1,2,3,4,5,11,11,12,14,20 n=10, the equation will be
¼(10+1) which equals 11/4 or 2.75.
• This means that the first Quartile is located at position 2.75 in the data
set i.e., between the 2nd and 3rd numbers in the data set
Example:
• If you have a dataset of exam scores, a box plot can quickly show you
the overall distribution of scores, the median score, and any unusually
high or low scores (outliers).
Scatter Plots
• A scatter plot is a type of data visualization that displays individual data
points on a two-dimensional graph.
• Scatter plots often have a pattern.
• We call a data point an outlier if it doesn't fit the pattern.
• It helps in identifying relationships between two variables and spotting
outliers.
• Data Points: Each point on the scatter plot represents an individual
observation with its position determined by the values of the two
variables.
Example:
• Imagine you are analyzing the relationship between study hours and exam scores.
• A scatter plot will show you how exam scores change with different study hours.
• Points that are far away from the general cluster of data might indicate outliers
(e.g., a student who studied very little but scored very high or vice versa).
• Box Plots are useful for summarizing data distributions and spotting
outliers in a single variable.

• Scatter Plots help visualize relationships between two variables and


identify outliers within the context of this relationship.
Statistical Methods for Detecting
Outliers
Z-Scores
• A Z-score is a statistical measure that describes how far a data point is
from the mean of a dataset, in terms of standard deviations.
• It helps in identifying outliers by showing how unusual a data point is
compared to the rest of the data.
Interpreting Z-Scores:
• A Z-score of 0 means the data point is exactly at the mean.
• Positive Z-scores indicate data points above the mean.
• Negative Z-scores indicate data points below the mean.
• Typically, a Z-score above 3 or below -3 is considered an outlier.
• This means the data point is more than 3 standard deviations away from the mean, which is quite unusual.
Inter Quartile Range (IQR) Method
• The Inter Quartile Range (IQR) method is a robust statistical approach for
detecting outliers based on the spread of the middle 50% of the data.
1. Calculation of IQR:
• First Quartile (Q1): The median of the lower half of the dataset (25th
percentile).
• Third Quartile (Q3): The median of the upper half of the dataset (75th
percentile).
• IQR: The range between Q1 and Q3:
IQR=Q3−Q1
2. Identifying Outliers:

• Data points that fall


• below Q1−1.5×IQR or
• above Q3+1.5×IQR
• are considered outliers.

• This rule is based on the idea


that most of the data should lie
within 1.5 times the IQR from
the quartiles.
• Z-Scores
• provide a way to determine how far a data point is from the mean, in terms of
standard deviations.
• Typically, Z-scores above 3 or below -3 indicate outliers.
• IQR Method
• uses the spread of the middle 50% of the data to define a range.
• Data points outside 1.5 times the IQR from the first and third quartiles are
considered outliers.
• Both methods are effective for identifying outliers and are widely used
in data analysis to ensure accurate results and insights.
Outlier Detection in Machine Learning

• There are various outlier


detection techniques in
machine learning that are
categorized as supervised
methods, semi-
supervised methods, and
unsupervised methods.
1. Supervised methods:

• These methods use labeled data to identify outliers.

• For example, a supervised outlier detection algorithm may use a decision tree
or a random forest to classify data points as outliers or non-outliers based on
the features of the data.
• Decision Tree:
• A tree-like model that splits the data into branches to make
decisions based on labeled training data.
• A decision tree is a series of yes/no questions that help us
sort and group data.
• Each question splits the data into smaller and smaller
groups based on the answers.
• Classifies data points as outliers or non-outliers based on
learned rules from the training set.
• Imagine you have a small set of numbers, which represent the ages of
people in a group:
ages = [5, 6, 6, 7, 8, 9, 10, 100]

How Can a Decision Tree Detect Outliers?


• When we use a decision tree to look at our data, it tries to group similar
ages together.
Step 1: Sort the Ages
sorted_ages = [5, 6, 6, 7, 8, 9, 10, 100]
Step 2: Create Questions to Split the Data
A decision tree will ask questions like:
Is the age less than 6.5? (Splits into [5, 6, 6] and [7, 8, 9, 10, 100])
Is the age less than 8.5? (Splits into [5, 6, 6, 7, 8] and [9, 10, 100])
Is the age less than 50? (Splits into [5, 6, 6, 7, 8, 9, 10] and [100])
Step 3: Look at the Groups
• After splitting the data, we end up with groups (also called leaves) of
ages:
• Group 1: [5, 6, 6]
• Group 2: [7, 8, 9, 10]
• Group 3: [100]
Step 4: Identify Outliers
• Outliers are in groups with very few data points.
• In this case, the group [100] has only one data point, making it an
outlier.
• The age 100 ends up alone in its own group, making it an outlier.
• By looking at which data points end up in small or unique groups, we can detect outliers using
a decision tree.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt

data = [
{"Size": 1500, "Price": 300},
{"Size": 1600, "Price": 320},
Dataset: {"Size": 1700, "Price": 340},
•A list of dictionaries containing Size and {"Size": 1800, "Price": 360},
{"Size": 1900, "Price": 380},
Price values, with some obvious outliers. {"Size": 2000, "Price": 400},
{"Size": 2500, "Price": 1000}, # Outlier
{"Size": 2600, "Price": 1050}, # Outlier
{"Size": 1700, "Price": 30}, # Outlier
]
# Convert the data into numpy arrays • Convert the dataset into NumPy arrays for
easier manipulation and feeding into the
X = np.array([[d["Size"]] for d in data]) model.
y = np.array([d["Price"] for d in data])

# Create and fit the decision tree model • Use the DecisionTreeRegressor to fit the
model to the data.
model = DecisionTreeRegressor()
model.fit(X, y)

# Predict prices using the trained model • Use the trained model to predict
prices based on Size.
predictions = model.predict(X)
•Compute the absolute differences between actual prices and predicted prices.

# Calculate residuals (absolute differences between actual and predicted prices)

residuals = np.abs(predictions - y)

• Determine outliers based on a set threshold for residuals (e.g., 100).

# Set a threshold for detecting outliers (e.g., residual > 100)

threshold = 100
outliers = np.where(residuals > threshold)[0]
# Print out the detected outliers
•Print the indices and details of detected
print("Detected outliers at indices:", outliers)
outliers.
for index in outliers:
print(data[index])

# Plot the results

•Plot the actual data, plt.scatter(X, y, color='blue', label='Actual Data’)


predicted prices, and plt.plot(X, predictions, color='green', label='Predicted Prices',
linewidth=2)
highlight the outliers. plt.scatter(X[outliers], y[outliers], color='red', label='Outliers',
marker='x', s=100)
plt.xlabel('Size (sq ft)’)
plt.ylabel('Price (in thousands)’)
plt.legend()
plt.title('Outlier Detection using Decision Trees’)
plt.show()
• Output:
• Random Forest:

• Outlier detection using Random Forests can be effectively


demonstrated through the concept of the Isolation Forest, which is
specifically designed for anomaly detection.

• Unlike traditional Random Forests used for classification or


regression, the
• Isolation Forest isolates observations by randomly selecting a feature and
then randomly selecting a split value between the maximum and minimum
values of the selected feature.
Isolation Forest:
• A tree-based algorithm that identifies outliers by isolating them from the rest of the
data.
Step 1: Import Necessary Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

Step 2: Create a Simple Dataset


data = {
'house_price': [200000, 250000, 300000, 350000, 400000, 450000, 500000, 550000, 600000, 10000000] # Note the outlier}
df = pd.DataFrame(data)

Step 3: Visualize the Data

plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('House Prices Scatter Plot')
plt.show()
Step 4: Apply Isolation Forest

# Initialize the Isolation Forest model


iso_forest = IsolationForest(contamination=0.1) # Here, we assume 10% of the data could be outliers

# Fit the model to the data


iso_forest.fit(df[['house_price']])
# Predict the outliers
df['outlier'] = iso_forest.predict(df[['house_price’]])
# The model assigns -1 to outliers and 1 to inliers
Step 5: Visualize the Outliers

plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue', label='Inliers')
plt.scatter(df[df['outlier'] == -1].index, df[df['outlier'] == -1]['house_price'], color='red',
label='Outliers')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('Outlier Detection in House Prices using Isolation Forest')
plt.legend()
plt.show()

Step 6: Examine the Results


print(df)
2. Semi-supervised methods:

• These methods use a combination of labeled and unlabeled data to identify


outliers.
• For example, a semi-supervised outlier detection algorithm may use clustering
to group similar data points together and then use the labeled data to identify
outliers within the clusters.
Outliers within the Clusters
Example:Let's use a simple dataset with labeled and unlabeled points.
• We'll apply a clustering algorithm to group similar data points and then use labeled data to
detect outliers within these clusters.

Steps:

• Create the Dataset: We'll create a dataset with labeled (normal and outlier) and unlabeled
points.

• Cluster the Data: Use a clustering algorithm (e.g., KMeans) to group similar data points.

• Identify Outliers within Clusters: Use the labeled data to identify which points within each
cluster are outliers.
Steps:
1.Dataset:
• Contains labeled points (normal and outlier) and unlabeled points.
2.Clustering:
• Use KMeans to cluster the data into two groups.
3.Calculate Distances:
• Compute the distances of labeled points to their respective cluster centers.
4.Set Threshold:
• Determine a threshold for identifying outliers (e.g., mean + 2*std of distances).
5.Identify Outliers:
• Flag points with distances greater than the threshold as outliers.
6.Plot the Results:
• Visualize the clusters, labeled data, and detected outliers.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

# Sample dataset with labeled (1: normal, -1: outlier) and unlabeled (0: unknown) points

data = [
{"Size": 1500, "Price": 300, "Label": 1},
{"Size": 1600, "Price": 320, "Label": 1},
{"Size": 1700, "Price": 340, "Label": 1},
{"Size": 1800, "Price": 360, "Label": 1},
{"Size": 1900, "Price": 380, "Label": 1},
{"Size": 2000, "Price": 400, "Label": 1},
{"Size": 2500, "Price": 1000, "Label": -1}, # Outlier
{"Size": 2600, "Price": 1050, "Label": -1}, # Outlier
{"Size": 1700, "Price": 30, "Label": -1}, # Outlier
{"Size": 1750, "Price": 350, "Label": 0}, # Unlabeled
{"Size": 1650, "Price": 310, "Label": 0}, # Unlabeled
]
# Convert the data into numpy arrays

X = np.array([[d["Size"], d["Price"]] for d in data])


labels = np.array([d["Label"] for d in data])

# Apply KMeans clustering

kmeans = KMeans(n_clusters=2, random_state=42).fit(X)


clusters = kmeans.predict(X)

# Separate labeled data

labeled_data = X[labels != 0]
labeled_labels = labels[labels != 0]
# Calculate distances of labeled points to their cluster centers

closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, labeled_data)


distances = np.linalg.norm(labeled_data - kmeans.cluster_centers_[clusters[labels != 0]], axis=1)

# Set a threshold for detecting outliers (e.g., mean + 2*std of distances)

threshold = np.mean(distances) + 2 * np.std(distances)


outliers = labeled_data[distances > threshold]

# Print out the detected outliers

print("Detected outliers within clusters:")


for outlier in outliers:
print(outlier)
# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=clusters, cmap='viridis', marker='o', label='Unlabeled Data')
plt.scatter(labeled_data[:, 0], labeled_data[:, 1], c=labeled_labels, cmap='coolwarm', marker='x',
label='Labeled Data')
plt.scatter(outliers[:, 0], outliers[:, 1], color='red', marker='s', s=100, label='Detected Outliers')
plt.xlabel('Size (sq ft)')
plt.ylabel('Price (in thousands)')
plt.legend()
plt.title('Semi-Supervised Outlier Detection using Clustering')
plt.show()
Output:
Detected outliers within clusters:
3. Unsupervised methods:

• These methods use only unlabeled data to identify outliers.

• For example, unsupervised outlier detection methods can use density-based or


distance-based methods to identify data points that are far away from the rest
of the data.

• Some popular unsupervised methods include the Local Outlier Factor (LOF),
k-nearest neighbor (KNN) based method, DBSCAN.
1. Local Outlier Factor (LOF)
• LOF measures the local density deviation of a data point compared to its
neighbors.
• Points with significantly lower density than their neighbors are
considered outliers.
2. k-Nearest Neighbors (k-NN)
• The k-NN method calculates the distance of a data point to its k-nearest
neighbors.
• Points far from their neighbors are flagged as outliers.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• DBSCAN groups data points into clusters based on density.
• Points that do not fit into any cluster are considered outliers (noise).
4. One-Class SVM
• One-Class Support Vector Machine (SVM) tries to separate the normal
data points from the outliers by finding a hyperplane that best encloses
the majority of the data points.
• In addition to these three main categories, there are also other methods for
outlier detection, such as
• ensemble methods that combine multiple methods or
• deep learning-based methods that use neural networks to identify outliers.

• It is necessary to note that the method for outlier detection will depend on
the specific characteristics of the data and the problem at hand.

• It’s also important to consider the trade-off between computational cost


and the accuracy of outlier detection.
Sorting the Data for Outlier Detection
• Sorting data is a fundamental step in data analysis that helps organize
the dataset, making it easier to identify outliers and discern patterns.

1. Sorting in Ascending/Descending Order

2. Grouping by Categories
Sorting in Ascending/Descending Order
Numerical Order:
1. Ascending Order: Sorting numerical data from the smallest to the largest value.
2. Descending Order: Sorting numerical data from the largest to the smallest value.
How It Helps:
• Highlighting Outliers: By sorting data, outliers become more apparent. For
example, if most values are clustered within a specific range, but a few values
are significantly higher or lower, these outliers will stand out at the ends of
the sorted list.
• Easier Visualization: Sorted data makes it easier to create visualizations, such
as box plots and scatter plots, where outliers and patterns can be more easily
identified.
Grouping by Categories
1.Categorical Data:
1. Sorting categorical data involves organizing the data into groups based on
categories.
How It Helps:
• Analyzing Each Group Separately: By sorting and grouping data by
categories, you can analyze each group independently. This makes it
easier to identify anomalies or patterns within each group.
• Comparing Groups: Sorting data by categories allows for comparison
between different groups, which can help identify any discrepancies or
outliers in specific categories.
Drawing Graphs/ Plots

Visualizing data with graphs and plots is an effective way to detect


and understand outliers:

1.Box Plot:

2.Scatter Plot

3.Histogram
Drawing Graphs/ Plots

1.Box Plot: A box plot shows the distribution of data and highlights
outliers as points outside the whiskers.
2. Scatter Plot:
• A scatter plot displays individual data points, helping identify
outliers in two-dimensional data.
3. Histogram: A histogram shows the frequency distribution of
data, with outliers appearing as bars separated from the main
cluster.
How to Handle Outliers
• Trimming refers to the process of removing a specified percentage of the highest and
lowest values in a dataset.

• Trimming Process
• Define the Trimming Percentage:
• Decide how much data you want to trim from the top and bottom ends of the
distribution.
• Common choices are 1%, 5%, or 10%, depending on the dataset and the context.
• Sort the Data:
• Sort the dataset in ascending order to easily identify the highest and lowest values.
• Determine the Cutoff Points:
• Calculate the indices for the values to be removed based on the trimming percentage.
• For example, if you decide to trim 5% from both ends of a dataset of 1000
observations, you would remove the lowest 50 and highest 50 values.
• Remove the Outliers:
• Exclude the determined highest and lowest values from your dataset.
Considerations of Trimming:

•Trimming can improve model performance by reducing the influence of


extreme values, but it may also lead to loss of valuable information.

•Ensure that the trimmed values are genuinely outliers rather than
important data points.

•Trimming is a more aggressive approach compared to capping or


winsorizing, as it completely removes data rather than adjusting its value.

•It’s helpful to visualize the data before and after trimming (using histograms
or box plots) to understand the impact of this method.
• Capping, also known as Winsorizing, is a method for handling outliers by replacing
extreme values with the nearest value within a specified range.
• This approach helps mitigate the impact of outliers on statistical analysis without
completely removing data points.
Capping Process
• Determine Capping Percentiles:
• Decide on the percentile thresholds for capping.
• Common choices are the 1st and 99th percentiles, or the 5th and 95th percentiles,
depending on how aggressive you want the capping to be.
• Calculate the Cutoff Values:
• Compute the values at the specified percentiles. For example, if using the 5th and
95th percentiles, you would find these values in your dataset.
• Replace Outliers:
• For values below the lower percentile (e.g., 5th), replace them with the value at the
lower percentile.
• For values above the upper percentile (e.g., 95th), replace them with the value at the
upper percentile.
Considerations

•Capping allows you to retain all data points, which can be


beneficial for analyses that require a full dataset.

•While capping reduces the influence of extreme values, it may


introduce bias if the capped values represent valid extreme cases.

•It’s useful to visualize the data before and after capping (using box
plots or histograms) to see the effect on the distribution.
Discretization
• It is the process of turning continuous data, like
heights or temperatures, into distinct categories or
groups.
• For example, instead of measuring height as a
precise number (like 5.4 feet), you might group it
into categories like "short," "average," and "tall."
• This helps simplify the data, making it easier to
analyze and understand.
Methods of Discretization
1. Equal Width Binning:
• The range of the continuous variable is divided into a specified number
of equal-width intervals.
• Example: If the range is from 0 to 100 and you want 5 bins, each bin
would cover a range of 20 (0-20, 21-40, etc.).

2. Equal Frequency Binning (Quantile Binning):


• The data is divided into bins such that each bin contains approximately
the same number of observations.
• Example: If you have 100 data points and want 4 bins, each bin will
contain 25 data points.
3. K-means Clustering:
•Use clustering algorithms like K-means to group data points into clusters,
then assign each point to a cluster (bin).
•This method can adapt to the distribution of the data rather than relying on
fixed intervals.
•Let's say we have a dataset of house prices and we want to discretize the
prices into bins.

Apply K-means Clustering


•Choose the number of clusters, k=3.
•Apply K-means clustering to partition the prices into 3 clusters.
4. Decision Tree-Based Binning:
•Decision trees can be used to determine optimal cut points for discretization
based on the target variable.
•This method creates bins that maximize the separation of different classes or
values.
•Let's consider a dataset for predicting house prices with a continuous
feature, "square footage," and the target variable, "price range" (e.g., low,
medium, high).
•Determine optimal cut points :
• The decision tree might determine that houses with square footage
• below 1000 tend to fall into the "low" price range,
• between 1000 and 2000 into the "medium" price range, and
• above 2000 into the "high" price range.
5.Custom Binning:
•Define your own bins based on domain knowledge or specific requirements
of the analysis.
Considerations:

•The method of discretization can significantly impact the performance of


machine learning models. Choose a method that aligns with the nature of
the data and the analysis goals.

•The number of bins should be carefully selected; too few bins can
oversimplify the data, while too many bins can lead to overfitting.

•Discretization may lead to a loss of information, so it’s essential to evaluate


the trade-offs between simplicity and accuracy.

You might also like