Outliers are data points that differ significantly from the rest of the data. These values are often far removed from the general pattern of the dataset, disrupting its overall distribution. Outlier detection is an important statistical technique used to identify these unusual values, which could result from various factors like measurement errors, incorrect data entry or genuinely rare events.
Impact of Outliers on Models
Outliers can have several detrimental effects on the performance and accuracy of machine learning models:
- Skewed Data Distribution: Outliers can distort the shape of the data, making it unrepresentative of the underlying trend.
- Distorted Statistical Metrics: They can alter essential statistics, such as the mean, variance and standard deviation, leading to inaccurate conclusions.
- Biased Model Accuracy: Outliers can bias the model, reducing its ability to generalize to new data and impacting overall prediction accuracy.
Implementation of Outlier Dectection
We will explore different methods to detect and remove outliers present in a given dataset.
1. Create Data with Outliers
We will create a sample data containing the outlier inside it using the rnorm() function and generating 500 different data points. Further, we will be adding 10 random outliers to this data.
R
data <- rnorm(500)
data[1:10] <- c(46,9,15,-90,
42,50,-82,74,61,-32)
2. Visualizing Outliers Using Boxplot
We use the boxplot() function to visualize outliers. Outliers are identified as points outside the "whiskers" of the boxplot.
Syntax:
boxplot(x, data, notch, varwidth, names, main)
Parameters:
- x: This parameter sets as a vector or a formula.
- data: This parameter sets the data frame.
- notch: This parameter is the label for horizontal axis.
- varwidth: This parameter is a logical value. Set as true to draw width of the box proportionate to the sample size.
- main: This parameter is the title of the chart.
- names: This parameter are the group labels that will be showed under each boxplot.
R
data <- rnorm(500)
data[1:10] <- c(46,9,15,-90,
42,50,-82,74,61,-32)
boxplot(data)
Output:
Outlier Detection3. Removing Outliers
We will remove the outlier using the boxplot.stats() function, which returns outlier values. The !data %in% condition removes these outliers from the data.
R
newdata <- data[!data %in% boxplot.stats(data)$out]
4. Verifying Outlier Removal
We will just verify if the outliner has been removed from the data simply by plotting the boxplot again.
R
Output:
Outlier DetectionAs we can see in the output plot that there is no outlier plotted in the plot. so, we successfully analyze and remove the outlier.
5. Visualizing Outliers with a Histogram
Histograms are another way to detect outliers visually. Here, we create a dataset with random outliers and plot a histogram.
R
set.seed(123)
data <- c(rnorm(1000), 10, 15, 12, 100)
hist(data)
Output:
Outlier Detection6. Detecting and Removing Outliers from Multiple Columns
To detect and remove outliers from a data frame, we use the Interquartile range (IQR) method. If an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it is considered an outlier.
We create functions to detect and remove outliers using the IQR method.
- detect_outlier() function calculates the IQR and identifies values outside the acceptable range.
- remove_outlier() function iterates through the columns of the data frame and removes the rows that contain outliers based on the IQR method.
R
sample_data <- data.frame(x=c(1, 2, 3, 4, 3, 12, 3, 4, 4, 15, 0),
y=c(4, 3, 25, 7, 8, 5, 9, 77, 6, 5, 0),
z=c(1, 3, 2, 90, 8, 7, 0, 48, 7, 2, 3))
print("Display original dataframe")
print(sample_data)
boxplot(sample_data)
detect_outlier <- function(x) {
Quantile1 <- quantile(x, probs=.25)
Quantile3 <- quantile(x, probs=.75)
IQR = Quantile3-Quantile1
x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}
remove_outlier <- function(dataframe,
columns=names(dataframe)) {
for (col in columns) {
dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
}
return(dataframe)
}
remove_outlier(sample_data, c('x', 'y', 'z'))
Output:
Sample Data containing outliers
Outliers Detection
Output after outlier removalIn this article, we learned how to detect and remove outliers in R using visualizations and statistical methods like the IQR method, ensuring cleaner data for better analysis and model accuracy