Open In App

Descriptive Statistic in R

Last Updated : 23 Jul, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Descriptive statistics are techniques used to summarize and find the key characteristics of a dataset. It's a important step in data analysis, helping to provide a clear overview before moving to more advanced modeling. It involves summarizing and visualizing the key characteristics of a dataset, giving insights into its structure, patterns and any potential issues that may require cleaning.

Quantitative Descriptive Statistics

Quantitative descriptive statistics typically include measures like mean, median, mode, standard deviation and variance. These statistics provide us with key insights like:

  • Mean: Used to find the average of symmetric data when there are no extreme outliers.
  • Median: Used to represent the "typical" value in skewed data or when outliers are present.
  • Mode: Used to find the most frequent value, especially in categorical or discrete data.
  • Variance: Used to understand how spread out the data is, important for modeling and predicting behavior.
  • Standard Deviation: Used to understand the consistency or spread of data around the mean.
  • First Quartile (Q1): Used to identify the lower 25% of the data, helping to understand the lower spread.
  • Median (Q2): Used as a measure of the central point, dividing the dataset into two halves.
  • Third Quartile (Q3): Used to identify the upper 25% of the data, helping to understand the upper spread.
  • Interquartile Range (IQR): Used to detect outliers by measuring the spread between the first and third quartiles.

Implementation of Quantitative Descriptive Statistical Analysis in R

We will implement quantitative descriptive statistics using R programming language. Base R has all the needed functions for calculating all the statistical values we need.

1. Loading the Data

The iris dataset is a built-in dataset in R that contains measurements of the sepal and petal length and width for 150 iris flowers, categorized by species (Setosa, Versicolor and Virginica).

R
data(iris)
df <- iris

head(df)

Output:

iris
Loading the Data

2. Finding Minimum and Maximum Values

We will calculate the minimum and maximum values of the Sepal.Length feature. We can use the built-in min() and max() functions.

R
cat("Minimum Sepal Length:", min(df$Sepal.Length), "\n")

cat("Maximum Sepal Length:", max(df$Sepal.Length), "\n")

Output:

Minimum Sepal Length: 4.3
Maximum Sepal Length: 7.9

3. Calculating Mean, Median and Quartiles

To find particular central tendencies like mean, mode, median, quantile and percentile we have inbuilt functions for them by their name only which can be used to find a particular measure.

R
cat("Mean of Sepal Length:", mean(df$Sepal.Length), "\n")

cat("Median of Sepal Length:", median(df$Sepal.Length), "\n")

cat("1st Quartile of Sepal Length:", quantile(df$Sepal.Length, 0.25), "\n")

cat("3rd Quartile of Sepal Length:", quantile(df$Sepal.Length, 0.75), "\n")

cat("Interquartile Range of Sepal Length:", IQR(df$Sepal.Length), "\n")

Output:

Mean of Sepal Length: 5.843333
Median of Sepal Length: 5.8
1st Quartile of Sepal Length: 5.1
3rd Quartile of Sepal Length: 6.4
Interquartile Range of Sepal Length: 1.3

4. Standard Deviation and Variance

To measure the spread of data, we can calculate the standard deviation and variance. These metrics give us an idea of how much the values deviate from the mean.

R
cat("Standard Deviation of Sepal Length:", sd(df$Sepal.Length), "\n")

cat("Variance of Sepal Length:", var(df$Sepal.Length), "\n")

Output:

Standard Deviation of Sepal Length: 0.8280661
Variance of Sepal Length: 0.6856935

5. Five-Number Summary

The fivenum() function provides a quick summary of the data, including the minimum, 1st quartile, median, 3rd quartile and maximum.

R
cat("Five-number summary of Sepal Length:\n", fivenum(df$Sepal.Length), "\n")

Output:

Five-number summary of Sepal Length:
4.3 5.1 5.8 6.4 7.9

6. Summary Function

For a comprehensive summary, the summary() function provides the min, 1st quartile, median, mean, 3rd quartile and max of each numeric column.

R
cat("Summary of the dataset:\n\n")
summary(df)

Output:

summary
Summary Function

7. Grouping by Species

We can also group the dataset by the Species column to get descriptive statistics for each species separately

R
by(df, df$Species, summary)

Output:

groupby
Grouping by Species

7. Using Additional Packages for Descriptive Statistics

We also have different packages as well which provide such functions which are mainly to get the descriptive statistics of the dataset.

7.1. Pastecs Package

For example, we have stat.desc() function in the pastecs package which provides all the statistical measures of the dataset.

R
install.packages("pastecs")
library(pastecs)

stat.desc(df)

Output:

pastecs
Pastecs Package

7.2. Psych Package

Another such example is the describe() function of the psych package which is similar to the describe() method available in the pandas library of Python.

R
install.packages("psych")
library(psych)

describe(iris)

Output:

pysch
Psych Package

Graphical Descriptive Statistical Analysis

Analyzing numbers requires some level of expertise in statistics. To tackle this problem we can create different types of visualizations as well to perform descriptive statistical data analysis. Some of such data visualizations that can be used for descriptive statistical analysis:

  • Histogram: Visualize data distribution, check for skewness and outliers.
  • Boxplot: Compare distributions, detect outliers and analyze spread.
  • Scatter Plot: Explore relationships or correlations between two variables.
  • Q-Q Plot: Check if data follows a specific distribution (e.g., normality).
  • Line Plot: Visualize trends or changes over time or sequence.
  • Correlation Plot: Assess pairwise correlations between multiple variables.
  • Density Plot: Visualize smooth data distribution and central tendency.

Implementation of Graphical Descriptive Statistical Analysis in R

We will implement graphical descriptive statistics using R programming language. Base R has nearly all the needed functions for plotting all the visualizations we need. We will also use ggplot2 for some visualization.

1. Histogram

A histogram shows the distribution of a single variable. It groups the data into bins and the height of each bar represents the number of data points in each bin.

R
hist(df$Sepal.Length,
     main = "Histogram of Sepal Length",
     xlab = "Sepal Length (cm)",
     ylab = "Frequency",
     col = "lightblue", 
     border = "black", 
     breaks = 10)

Output:

hist
Histogram

2. Boxplot

A Boxplot shows the spread of data and highlights the median, quartiles and potential outliers.

R
boxplot(Sepal.Length ~ Species, data = df,
        main = "Box Plot of Sepal Length by Species",
        xlab = "Species", 
        ylab = "Sepal Length (cm)",
        col = c("lightblue", "lightgreen", "lightpink"),    
        notch = FALSE,
        horizontal = FALSE)

Output:

box
Boxplot

3. Scatter Plot

A scatter plot is a set of dotted points to represent individual pieces of data on the horizontal and vertical axis. The values of two variables are plotted along the X-axis and Y-axis, the pattern of the resulting points reveals a correlation between them.

R
plot(df$Sepal.Length, df$Petal.Length,
     main = "Scatter Plot of Sepal Length vs Petal Length",
     xlab = "Sepal Length (cm)",
     ylab = "Petal Length (cm)",
     pch = 20,                
     col = "purple",            
     cex = 1.5,
)

Output:

scatter
Scatter Plot

4. QQ Plot

The (quantile-quantile) Q-Q Plot helps us determine if the data follows a normal distribution. It compares the quantiles of the data against the quantiles of a normal distribution.

R
qqnorm(iris$Sepal.Length,
       main = "Q-Q Plot of Sepal Length",
       xlab = "Theoretical Quantiles",
       ylab = "Sample Quantiles",
       col = "darkgreen",         
       pch = 20,              
       cex = 1.5
)

qqline(x, col = "red", lwd = 2)

Output:

qq
QQ Plot

5. Line Plot

In a line plot, is useful for visualizing trends over time or across ordered categories.

R
plot(iris$Sepal.Length, type='o',
     col = "darkred", xlab = "Size",
     ylab = "Sepal length",
     main = "Iris sepal length")

Output:

lineplot
Line Plot

6. Correlation Plot

A Correlation Plot visualizes the pairwise correlations between multiple variables.

R
install,packages("corrplot")
library(corrplot)

corr_matrix <- cor(data.matrix(df[, sapply(df, is.numeric)]))  

corrplot(corr_matrix, 
         method = "circle",      
         type = "upper",            
         col = colorRampPalette(c("blue", "white", "lightblue"))(200),  
         tl.col = "black",          
         addCoef.col = "black",     
         number.cex = 1,       
         diag = TRUE)  

Output:

corr
Correlation Plot

7. Density Plot

Density Plot is a type of data visualization tool. It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a histogram inferred from a data.

R
install.packages('ggplot2') 
install.packages('readxl') 

library(readxl)
library(ggplot2)

ggplot(iris, aes(x = Sepal.Length)) +
  geom_density(fill = "red", alpha = 0.5) +  
  labs(title = "Density plot of Sepal Length", x = "Sepal Length", y = "Density") +
  theme_minimal()

Output:

density
Density Plot

We can also plot the density plot as well as histogram on the same plot as well.

R
ggplot(iris, aes(x = Sepal.Length)) +
  geom_histogram(aes(y = ..density..), fill = "lightblue", color = "black", bins = 30) + 
  geom_density(color = "red", size = 1.5) +  
  labs(title = "Iris Sepal Length", x = "Sepal Length", y = "Density") +
  theme_minimal() 

Output:

hist_density
Histogram-density-plot

Article Tags :

Similar Reads