Outliers are data points that differ significantly from the rest of the data. These values are often far removed from the general pattern of the dataset, disrupting its overall distribution. Outlier detection is an important statistical technique used to identify these unusual values, which could result from various factors like measurement errors, incorrect data entry or genuinely rare events.
Impact of Outliers on Models
Outliers can have several detrimental effects on the performance and accuracy of machine learning models:
- Skewed Data Distribution: Outliers can distort the shape of the data, making it unrepresentative of the underlying trend.
- Distorted Statistical Metrics: They can alter essential statistics, such as the mean, variance and standard deviation, leading to inaccurate conclusions.
- Biased Model Accuracy: Outliers can bias the model, reducing its ability to generalize to new data and impacting overall prediction accuracy.
Implementation of Outlier Dectection
We will explore different methods to detect and remove outliers present in a given dataset.
1. Create Data with Outliers
We will create a sample data containing the outlier inside it using the rnorm() function and generating 500 different data points. Further, we will be adding 10 random outliers to this data.
R
data <- rnorm(500)
data[1:10] <- c(46,9,15,-90,
42,50,-82,74,61,-32)
2. Visualizing Outliers Using Boxplot
We use the boxplot() function to visualize outliers. Outliers are identified as points outside the "whiskers" of the boxplot.
Syntax:
boxplot(x, data, notch, varwidth, names, main)
Parameters:
- x: This parameter sets as a vector or a formula.
- data: This parameter sets the data frame.
- notch: This parameter is the label for horizontal axis.
- varwidth: This parameter is a logical value. Set as true to draw width of the box proportionate to the sample size.
- main: This parameter is the title of the chart.
- names: This parameter are the group labels that will be showed under each boxplot.
R
data <- rnorm(500)
data[1:10] <- c(46,9,15,-90,
42,50,-82,74,61,-32)
boxplot(data)
Output:
Outlier Detection3. Removing Outliers
We will remove the outlier using the boxplot.stats() function, which returns outlier values. The !data %in% condition removes these outliers from the data.
R
newdata <- data[!data %in% boxplot.stats(data)$out]
4. Verifying Outlier Removal
We will just verify if the outliner has been removed from the data simply by plotting the boxplot again.
R
Output:
Outlier DetectionAs we can see in the output plot that there is no outlier plotted in the plot. so, we successfully analyze and remove the outlier.
5. Visualizing Outliers with a Histogram
Histograms are another way to detect outliers visually. Here, we create a dataset with random outliers and plot a histogram.
R
set.seed(123)
data <- c(rnorm(1000), 10, 15, 12, 100)
hist(data)
Output:
Outlier Detection6. Detecting and Removing Outliers from Multiple Columns
To detect and remove outliers from a data frame, we use the Interquartile range (IQR) method. If an observation is 1.5 times the interquartile range greater than the third quartile or 1.5 times the interquartile range less than the first quartile it is considered an outlier.
We create functions to detect and remove outliers using the IQR method.
- detect_outlier() function calculates the IQR and identifies values outside the acceptable range.
- remove_outlier() function iterates through the columns of the data frame and removes the rows that contain outliers based on the IQR method.
R
sample_data <- data.frame(x=c(1, 2, 3, 4, 3, 12, 3, 4, 4, 15, 0),
y=c(4, 3, 25, 7, 8, 5, 9, 77, 6, 5, 0),
z=c(1, 3, 2, 90, 8, 7, 0, 48, 7, 2, 3))
print("Display original dataframe")
print(sample_data)
boxplot(sample_data)
detect_outlier <- function(x) {
Quantile1 <- quantile(x, probs=.25)
Quantile3 <- quantile(x, probs=.75)
IQR = Quantile3-Quantile1
x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}
remove_outlier <- function(dataframe,
columns=names(dataframe)) {
for (col in columns) {
dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
}
return(dataframe)
}
remove_outlier(sample_data, c('x', 'y', 'z'))
Output:
Sample Data containing outliers
Outliers Detection
Output after outlier removalIn this article, we learned how to detect and remove outliers in R using visualizations and statistical methods like the IQR method, ensuring cleaner data for better analysis and model accuracy
Similar Reads
Multivariate Analysis in R Multivariate analysis refers to the statistical techniques used to analyze data sets with multiple variables. It helps uncover relationships, reduce complexity and interpret underlying structures in data. These variables can be quantitative or categorical and analyzing them together helps us underst
5 min read
Trend Analysis in R Trend analysis is a statistical technique used to identify and analyze patterns or trends in data over time. It involves examining data points collected at regular intervals (such as daily, monthly, or yearly) to uncover underlying trends, changes, or patterns in the behavior of a variable. Trend an
5 min read
Time Series Analysis in R Time series analysis is a statistical technique used to understand how data points evolve over time. In R programming, time series analysis can be efficiently performed using the ts() function, which helps organize data with associated time stamps. This method is widely applied in business and resea
3 min read
Predictive Analysis in R Programming Predictive analysis in R Language is a branch of analysis which uses statistics operations to analyze historical facts to make predict future events. It is a common term used in data mining and machine learning. Methods like time series analysis, non-linear least square, etc. are used in predictive
4 min read
Power Analysis in Statistics with R Power analysis is a critical aspect of experimental design in statistics. It helps determine the sample size required to detect an effect of a given size with a certain degree of confidence. In this article, we'll explore the fundamentals of power analysis, its importance, and how to conduct power a
4 min read
Regression Analysis in R Programming Regression analysis is a statistical method used to determine the relationship between a dependent variable and one or more independent variables. Regression analysis is commonly used for prediction, forecasting and determining relationships between variables. In R, there are several types of regres
4 min read