Concepts of EDA, Outliers-Detection and Treatment
Concepts of EDA, Outliers-Detection and Treatment
MLDS
Contents:
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• Exploratory Data Analysis (EDA):
• The process of EDA,
• knowing Initial Details about Data,
• Modifying or Removing Unwanted Data,
• Retrieving Data,
• Getting Statistical Data,
• Drawing Graphs/ Plots.
Exploratory Data Analysis (EDA)
• It is a crucial step in the data analysis process.
• It involves summarizing the main characteristics of the data, often
with visual methods.
• EDA is an iterative process that involves:
1.Understanding the initial structure and details of the data.
2.Cleaning and preprocessing the data.
3.Extracting and focusing on relevant subsets of data.
4.Computing statistical summaries.
5.Visualizing the data for better insights.
1. Knowing Initial Details
import pandas as pd
about Data:
• Understand the basic # Load dataset
df = pd.read_csv('your_dataset.csv')
structure and attributes
of the dataset. # View the first few rows
Steps: print(df.head())
•Load the dataset.
# Check data types and non-null values
•View the first few rows. print(df.info())
•Check the data types and
# Get a concise summary
non-null values.
print(df.describe())
•Get a concise summary of
the dataset.
2. Modifying or Removing
# Handling missing values
Unwanted Data:
• Clean the data by df = df.dropna()
# Drop rows with missing values
handling missing values,
# (or)
outliers, and irrelevant df = df.fillna(method='ffill’)
features. # Forward fill to handle missing values
Steps:
# Remove duplicate rows
•Identify and handle missing df = df.drop_duplicates()
values.
•Remove duplicate rows. # Drop irrelevant columns
df = df.drop(['column_name1',
•Drop irrelevant columns. 'column_name2'], axis=1)
3. Retrieving Data:
• Extract specific subsets of data for focused analysis.
Steps:
•Filter rows based on conditions.
•Select specific columns.
•Group data by categorical variables.
# Filter rows based on conditions
# Histogram
# Line plot (time series)
df['column_name'].hist()
df['column_name'].plot()
plt.show()
plt.show()
# Box plot
# Heatmap for correlation matrix
df.boxplot(column='column_name')
sns.heatmap(correlation_matrix,
plt.show()
annot=True)
plt.show()
# Scatter plot
plt.scatter(df['column1'], df['column2'])
plt.xlabel('column1')
plt.ylabel('column2')
plt.show()
• Outliers:
• Causes of Outliers,
• Detecting the Outliers,
• Sorting the Data,
• Drawing Graphs/ Plots,
• Inter Quartile Range (IQR) Method,
• How to Handle Outliers
• An outlier is a data point significantly different from other data points in a
dataset.
• They can significantly impact the outlier analysis in machine learning and
interpretation of the data, so it is essential to detect them.
Causes of Outliers
1. Measurement Errors: Mistakes during data collection or recording
can introduce outliers.
Ex:
• Suppose a researcher is measuring the height of students in a school.
• If the measuring tape is not used correctly, say it’s not held straight, it
could result in a height reading that is significantly higher or lower than
the actual height.
• This incorrect measurement can introduce outliers in the dataset.
Causes of Outliers
2. Data Entry Errors: Typographical errors or incorrect data entries can
create outliers.
Example:
• When entering data manually into a computer system, a typographical
error might occur.
• For instance, if a person's age is recorded as 250 instead of 25 due to a
typing error, this will appear as an outlier in the age data of the
population.
Causes of Outliers
Example:
• In biological data, there might be a natural occurrence of outliers due to
genetic mutations or other factors.
• For instance, in a dataset measuring human heights, someone with
gigantism (a condition causing excessive growth) might be an outlier
due to their unusually tall stature compared to the average population.
Causes of Outliers
5. Sampling Errors: When a sample does not represent the population
well, outliers can emerge.
Example:
• If a survey on household incomes is conducted in a wealthy
neighborhood only, the results will not represent the general
population accurately.
• This skewed sampling can result in outliers when the data is compared
with a more representative sample of the population.
Detecting Outliers
• Detecting outliers is crucial because they can distort the overall
picture of the data and lead to incorrect conclusions if not
appropriately handled.
• Outliers can also affect the performance of many machine learning
models, as they can skew the results and lead to overfitting or poor
generalization.
• Thus, detecting outliers is essential for cleaning and preparing the
data for analysis and ensuring the results’ validity.
Detecting Outliers-Visual Inspection
• Plotting the data, such as using scatter plots or histograms, can help you
visually identify outliers that stand out from the rest of the data.
• Box Plots: A box plot displays data distribution and highlights outliers outside
the "whiskers.“
• Scatter Plots: These plots show individual data points and help identify
outliers in the context of two variables.
Box Plots
• A box plot (also known as a whisker plot) is a standardized way
of displaying the distribution of data based on a five-number
summary.
• Five-Number Summary:
• Minimum: The smallest data point excluding outliers.
• First Quartile (Q1): The median of the lower half of the dataset.
• Median: The middle value of the dataset.
• Third Quartile (Q3): The median of the upper half of the dataset.
• Maximum: The largest data point excluding outliers.
• Box Plot Structure:
• For example, a supervised outlier detection algorithm may use a decision tree
or a random forest to classify data points as outliers or non-outliers based on
the features of the data.
• Decision Tree:
• A tree-like model that splits the data into branches to make
decisions based on labeled training data.
• A decision tree is a series of yes/no questions that help us
sort and group data.
• Each question splits the data into smaller and smaller
groups based on the answers.
• Classifies data points as outliers or non-outliers based on
learned rules from the training set.
• Imagine you have a small set of numbers, which represent the ages of
people in a group:
ages = [5, 6, 6, 7, 8, 9, 10, 100]
data = [
{"Size": 1500, "Price": 300},
{"Size": 1600, "Price": 320},
Dataset: {"Size": 1700, "Price": 340},
•A list of dictionaries containing Size and {"Size": 1800, "Price": 360},
{"Size": 1900, "Price": 380},
Price values, with some obvious outliers. {"Size": 2000, "Price": 400},
{"Size": 2500, "Price": 1000}, # Outlier
{"Size": 2600, "Price": 1050}, # Outlier
{"Size": 1700, "Price": 30}, # Outlier
]
# Convert the data into numpy arrays • Convert the dataset into NumPy arrays for
easier manipulation and feeding into the
X = np.array([[d["Size"]] for d in data]) model.
y = np.array([d["Price"] for d in data])
# Create and fit the decision tree model • Use the DecisionTreeRegressor to fit the
model to the data.
model = DecisionTreeRegressor()
model.fit(X, y)
# Predict prices using the trained model • Use the trained model to predict
prices based on Size.
predictions = model.predict(X)
•Compute the absolute differences between actual prices and predicted prices.
residuals = np.abs(predictions - y)
threshold = 100
outliers = np.where(residuals > threshold)[0]
# Print out the detected outliers
•Print the indices and details of detected
print("Detected outliers at indices:", outliers)
outliers.
for index in outliers:
print(data[index])
plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('House Prices Scatter Plot')
plt.show()
Step 4: Apply Isolation Forest
plt.figure(figsize=(10, 6))
plt.scatter(df.index, df['house_price'], color='blue', label='Inliers')
plt.scatter(df[df['outlier'] == -1].index, df[df['outlier'] == -1]['house_price'], color='red',
label='Outliers')
plt.xlabel('Index')
plt.ylabel('House Price')
plt.title('Outlier Detection in House Prices using Isolation Forest')
plt.legend()
plt.show()
Steps:
• Create the Dataset: We'll create a dataset with labeled (normal and outlier) and unlabeled
points.
• Cluster the Data: Use a clustering algorithm (e.g., KMeans) to group similar data points.
• Identify Outliers within Clusters: Use the labeled data to identify which points within each
cluster are outliers.
Steps:
1.Dataset:
• Contains labeled points (normal and outlier) and unlabeled points.
2.Clustering:
• Use KMeans to cluster the data into two groups.
3.Calculate Distances:
• Compute the distances of labeled points to their respective cluster centers.
4.Set Threshold:
• Determine a threshold for identifying outliers (e.g., mean + 2*std of distances).
5.Identify Outliers:
• Flag points with distances greater than the threshold as outliers.
6.Plot the Results:
• Visualize the clusters, labeled data, and detected outliers.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min
# Sample dataset with labeled (1: normal, -1: outlier) and unlabeled (0: unknown) points
data = [
{"Size": 1500, "Price": 300, "Label": 1},
{"Size": 1600, "Price": 320, "Label": 1},
{"Size": 1700, "Price": 340, "Label": 1},
{"Size": 1800, "Price": 360, "Label": 1},
{"Size": 1900, "Price": 380, "Label": 1},
{"Size": 2000, "Price": 400, "Label": 1},
{"Size": 2500, "Price": 1000, "Label": -1}, # Outlier
{"Size": 2600, "Price": 1050, "Label": -1}, # Outlier
{"Size": 1700, "Price": 30, "Label": -1}, # Outlier
{"Size": 1750, "Price": 350, "Label": 0}, # Unlabeled
{"Size": 1650, "Price": 310, "Label": 0}, # Unlabeled
]
# Convert the data into numpy arrays
labeled_data = X[labels != 0]
labeled_labels = labels[labels != 0]
# Calculate distances of labeled points to their cluster centers
• Some popular unsupervised methods include the Local Outlier Factor (LOF),
k-nearest neighbor (KNN) based method, DBSCAN.
1. Local Outlier Factor (LOF)
• LOF measures the local density deviation of a data point compared to its
neighbors.
• Points with significantly lower density than their neighbors are
considered outliers.
2. k-Nearest Neighbors (k-NN)
• The k-NN method calculates the distance of a data point to its k-nearest
neighbors.
• Points far from their neighbors are flagged as outliers.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
• DBSCAN groups data points into clusters based on density.
• Points that do not fit into any cluster are considered outliers (noise).
4. One-Class SVM
• One-Class Support Vector Machine (SVM) tries to separate the normal
data points from the outliers by finding a hyperplane that best encloses
the majority of the data points.
• In addition to these three main categories, there are also other methods for
outlier detection, such as
• ensemble methods that combine multiple methods or
• deep learning-based methods that use neural networks to identify outliers.
• It is necessary to note that the method for outlier detection will depend on
the specific characteristics of the data and the problem at hand.
2. Grouping by Categories
Sorting in Ascending/Descending Order
Numerical Order:
1. Ascending Order: Sorting numerical data from the smallest to the largest value.
2. Descending Order: Sorting numerical data from the largest to the smallest value.
How It Helps:
• Highlighting Outliers: By sorting data, outliers become more apparent. For
example, if most values are clustered within a specific range, but a few values
are significantly higher or lower, these outliers will stand out at the ends of
the sorted list.
• Easier Visualization: Sorted data makes it easier to create visualizations, such
as box plots and scatter plots, where outliers and patterns can be more easily
identified.
Grouping by Categories
1.Categorical Data:
1. Sorting categorical data involves organizing the data into groups based on
categories.
How It Helps:
• Analyzing Each Group Separately: By sorting and grouping data by
categories, you can analyze each group independently. This makes it
easier to identify anomalies or patterns within each group.
• Comparing Groups: Sorting data by categories allows for comparison
between different groups, which can help identify any discrepancies or
outliers in specific categories.
Drawing Graphs/ Plots
1.Box Plot:
2.Scatter Plot
3.Histogram
Drawing Graphs/ Plots
1.Box Plot: A box plot shows the distribution of data and highlights
outliers as points outside the whiskers.
2. Scatter Plot:
• A scatter plot displays individual data points, helping identify
outliers in two-dimensional data.
3. Histogram: A histogram shows the frequency distribution of
data, with outliers appearing as bars separated from the main
cluster.
How to Handle Outliers
• Trimming refers to the process of removing a specified percentage of the highest and
lowest values in a dataset.
• Trimming Process
• Define the Trimming Percentage:
• Decide how much data you want to trim from the top and bottom ends of the
distribution.
• Common choices are 1%, 5%, or 10%, depending on the dataset and the context.
• Sort the Data:
• Sort the dataset in ascending order to easily identify the highest and lowest values.
• Determine the Cutoff Points:
• Calculate the indices for the values to be removed based on the trimming percentage.
• For example, if you decide to trim 5% from both ends of a dataset of 1000
observations, you would remove the lowest 50 and highest 50 values.
• Remove the Outliers:
• Exclude the determined highest and lowest values from your dataset.
Considerations of Trimming:
•Ensure that the trimmed values are genuinely outliers rather than
important data points.
•It’s helpful to visualize the data before and after trimming (using histograms
or box plots) to understand the impact of this method.
• Capping, also known as Winsorizing, is a method for handling outliers by replacing
extreme values with the nearest value within a specified range.
• This approach helps mitigate the impact of outliers on statistical analysis without
completely removing data points.
Capping Process
• Determine Capping Percentiles:
• Decide on the percentile thresholds for capping.
• Common choices are the 1st and 99th percentiles, or the 5th and 95th percentiles,
depending on how aggressive you want the capping to be.
• Calculate the Cutoff Values:
• Compute the values at the specified percentiles. For example, if using the 5th and
95th percentiles, you would find these values in your dataset.
• Replace Outliers:
• For values below the lower percentile (e.g., 5th), replace them with the value at the
lower percentile.
• For values above the upper percentile (e.g., 95th), replace them with the value at the
upper percentile.
Considerations
•It’s useful to visualize the data before and after capping (using box
plots or histograms) to see the effect on the distribution.
Discretization
• It is the process of turning continuous data, like
heights or temperatures, into distinct categories or
groups.
• For example, instead of measuring height as a
precise number (like 5.4 feet), you might group it
into categories like "short," "average," and "tall."
• This helps simplify the data, making it easier to
analyze and understand.
Methods of Discretization
1. Equal Width Binning:
• The range of the continuous variable is divided into a specified number
of equal-width intervals.
• Example: If the range is from 0 to 100 and you want 5 bins, each bin
would cover a range of 20 (0-20, 21-40, etc.).
•The number of bins should be carefully selected; too few bins can
oversimplify the data, while too many bins can lead to overfitting.