Descriptive Statistic in R
Last Updated :
23 Jul, 2025
Descriptive statistics are techniques used to summarize and find the key characteristics of a dataset. It's a important step in data analysis, helping to provide a clear overview before moving to more advanced modeling. It involves summarizing and visualizing the key characteristics of a dataset, giving insights into its structure, patterns and any potential issues that may require cleaning.
Quantitative Descriptive Statistics
Quantitative descriptive statistics typically include measures like mean, median, mode, standard deviation and variance. These statistics provide us with key insights like:
- Mean: Used to find the average of symmetric data when there are no extreme outliers.
- Median: Used to represent the "typical" value in skewed data or when outliers are present.
- Mode: Used to find the most frequent value, especially in categorical or discrete data.
- Variance: Used to understand how spread out the data is, important for modeling and predicting behavior.
- Standard Deviation: Used to understand the consistency or spread of data around the mean.
- First Quartile (Q1): Used to identify the lower 25% of the data, helping to understand the lower spread.
- Median (Q2): Used as a measure of the central point, dividing the dataset into two halves.
- Third Quartile (Q3): Used to identify the upper 25% of the data, helping to understand the upper spread.
- Interquartile Range (IQR): Used to detect outliers by measuring the spread between the first and third quartiles.
Implementation of Quantitative Descriptive Statistical Analysis in R
We will implement quantitative descriptive statistics using R programming language. Base R has all the needed functions for calculating all the statistical values we need.
1. Loading the Data
The iris dataset is a built-in dataset in R that contains measurements of the sepal and petal length and width for 150 iris flowers, categorized by species (Setosa, Versicolor and Virginica).
R
data(iris)
df <- iris
head(df)
Output:
Loading the DataWe will calculate the minimum and maximum values of the Sepal.Length feature. We can use the built-in min() and max() functions.
R
cat("Minimum Sepal Length:", min(df$Sepal.Length), "\n")
cat("Maximum Sepal Length:", max(df$Sepal.Length), "\n")
Output:
Minimum Sepal Length: 4.3
Maximum Sepal Length: 7.9
To find particular central tendencies like mean, mode, median, quantile and percentile we have inbuilt functions for them by their name only which can be used to find a particular measure.
R
cat("Mean of Sepal Length:", mean(df$Sepal.Length), "\n")
cat("Median of Sepal Length:", median(df$Sepal.Length), "\n")
cat("1st Quartile of Sepal Length:", quantile(df$Sepal.Length, 0.25), "\n")
cat("3rd Quartile of Sepal Length:", quantile(df$Sepal.Length, 0.75), "\n")
cat("Interquartile Range of Sepal Length:", IQR(df$Sepal.Length), "\n")
Output:
Mean of Sepal Length: 5.843333
Median of Sepal Length: 5.8
1st Quartile of Sepal Length: 5.1
3rd Quartile of Sepal Length: 6.4
Interquartile Range of Sepal Length: 1.3
4. Standard Deviation and Variance
To measure the spread of data, we can calculate the standard deviation and variance. These metrics give us an idea of how much the values deviate from the mean.
R
cat("Standard Deviation of Sepal Length:", sd(df$Sepal.Length), "\n")
cat("Variance of Sepal Length:", var(df$Sepal.Length), "\n")
Output:
Standard Deviation of Sepal Length: 0.8280661
Variance of Sepal Length: 0.6856935
5. Five-Number Summary
The fivenum() function provides a quick summary of the data, including the minimum, 1st quartile, median, 3rd quartile and maximum.
R
cat("Five-number summary of Sepal Length:\n", fivenum(df$Sepal.Length), "\n")
Output:
Five-number summary of Sepal Length:
4.3 5.1 5.8 6.4 7.9
6. Summary Function
For a comprehensive summary, the summary() function provides the min, 1st quartile, median, mean, 3rd quartile and max of each numeric column.
R
cat("Summary of the dataset:\n\n")
summary(df)
Output:
Summary Function7. Grouping by Species
We can also group the dataset by the Species column to get descriptive statistics for each species separately
R
by(df, df$Species, summary)
Output:
Grouping by Species7. Using Additional Packages for Descriptive Statistics
We also have different packages as well which provide such functions which are mainly to get the descriptive statistics of the dataset.
7.1. Pastecs Package
For example, we have stat.desc() function in the pastecs package which provides all the statistical measures of the dataset.
R
install.packages("pastecs")
library(pastecs)
stat.desc(df)
Output:
Pastecs Package7.2. Psych Package
Another such example is the describe() function of the psych package which is similar to the describe() method available in the pandas library of Python.
R
install.packages("psych")
library(psych)
describe(iris)
Output:
Psych PackageGraphical Descriptive Statistical Analysis
Analyzing numbers requires some level of expertise in statistics. To tackle this problem we can create different types of visualizations as well to perform descriptive statistical data analysis. Some of such data visualizations that can be used for descriptive statistical analysis:
- Histogram: Visualize data distribution, check for skewness and outliers.
- Boxplot: Compare distributions, detect outliers and analyze spread.
- Scatter Plot: Explore relationships or correlations between two variables.
- Q-Q Plot: Check if data follows a specific distribution (e.g., normality).
- Line Plot: Visualize trends or changes over time or sequence.
- Correlation Plot: Assess pairwise correlations between multiple variables.
- Density Plot: Visualize smooth data distribution and central tendency.
Implementation of Graphical Descriptive Statistical Analysis in R
We will implement graphical descriptive statistics using R programming language. Base R has nearly all the needed functions for plotting all the visualizations we need. We will also use ggplot2 for some visualization.
1. Histogram
A histogram shows the distribution of a single variable. It groups the data into bins and the height of each bar represents the number of data points in each bin.
R
hist(df$Sepal.Length,
main = "Histogram of Sepal Length",
xlab = "Sepal Length (cm)",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 10)
Output:
Histogram2. Boxplot
A Boxplot shows the spread of data and highlights the median, quartiles and potential outliers.
R
boxplot(Sepal.Length ~ Species, data = df,
main = "Box Plot of Sepal Length by Species",
xlab = "Species",
ylab = "Sepal Length (cm)",
col = c("lightblue", "lightgreen", "lightpink"),
notch = FALSE,
horizontal = FALSE)
Output:
Boxplot3. Scatter Plot
A scatter plot is a set of dotted points to represent individual pieces of data on the horizontal and vertical axis. The values of two variables are plotted along the X-axis and Y-axis, the pattern of the resulting points reveals a correlation between them.
R
plot(df$Sepal.Length, df$Petal.Length,
main = "Scatter Plot of Sepal Length vs Petal Length",
xlab = "Sepal Length (cm)",
ylab = "Petal Length (cm)",
pch = 20,
col = "purple",
cex = 1.5,
)
Output:
Scatter Plot4. QQ Plot
The (quantile-quantile) Q-Q Plot helps us determine if the data follows a normal distribution. It compares the quantiles of the data against the quantiles of a normal distribution.
R
qqnorm(iris$Sepal.Length,
main = "Q-Q Plot of Sepal Length",
xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles",
col = "darkgreen",
pch = 20,
cex = 1.5
)
qqline(x, col = "red", lwd = 2)
Output:
QQ Plot5. Line Plot
In a line plot, is useful for visualizing trends over time or across ordered categories.
R
plot(iris$Sepal.Length, type='o',
col = "darkred", xlab = "Size",
ylab = "Sepal length",
main = "Iris sepal length")
Output:
Line Plot6. Correlation Plot
A Correlation Plot visualizes the pairwise correlations between multiple variables.
R
install,packages("corrplot")
library(corrplot)
corr_matrix <- cor(data.matrix(df[, sapply(df, is.numeric)]))
corrplot(corr_matrix,
method = "circle",
type = "upper",
col = colorRampPalette(c("blue", "white", "lightblue"))(200),
tl.col = "black",
addCoef.col = "black",
number.cex = 1,
diag = TRUE)
Output:
Correlation Plot7. Density Plot
Density Plot is a type of data visualization tool. It is a variation of the histogram that uses ‘kernel smoothing’ while plotting the values. It is a continuous and smooth version of a histogram inferred from a data.
R
install.packages('ggplot2')
install.packages('readxl')
library(readxl)
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length)) +
geom_density(fill = "red", alpha = 0.5) +
labs(title = "Density plot of Sepal Length", x = "Sepal Length", y = "Density") +
theme_minimal()
Output:
Density PlotWe can also plot the density plot as well as histogram on the same plot as well.
R
ggplot(iris, aes(x = Sepal.Length)) +
geom_histogram(aes(y = ..density..), fill = "lightblue", color = "black", bins = 30) +
geom_density(color = "red", size = 1.5) +
labs(title = "Iris Sepal Length", x = "Sepal Length", y = "Density") +
theme_minimal()
Output:
Histogram-density-plot
Similar Reads
Descriptive Statistics Descriptive statistics is a branch of statistics that focuses on summarizing and organizing data so it can be easily understood and interpreted. It helps in describing the main features of a dataset, either numerically or graphically, without making conclusions beyond the data itself ( that is done
12 min read
Descriptive Statistics in Excel Obtaining descriptive statistics for data collection may be helpful if you frequently work with huge datasets in Excel. A few key data points are provided by descriptive statistics, which you can utilize to quickly grasp the complete data set.Although you can calculate each of the statistical variab
4 min read
Categorical Data Descriptive Statistics in R Categorical data, representing non-measurable attributes, requires specialized analysis. This article explores descriptive statistics and visualization techniques in R Programming Language for categorical data, focusing on frequencies, proportions, bar charts, pie charts, frequency tables, and conti
12 min read
Compute Summary Statistics In R Summary statistics are values that describe and simplify a dataset. They include measures like mean, median, mode, range, standard deviation and variance. These values help understand the center, spread and shape of the data. In R programming language, they can be calculated using both built-in func
4 min read
Data Types in Statistics Data is a simple record or collection of different numbers, characters, images, and others that are processed to form Information. In statistics, we have different types of data that are used to represent various information. In statistics, we analyze the data to obtain any meaningful information an
6 min read
R - Statistics R is widely used for statistical analysis due to its built-in functions and support for handling complex data. It allows users to perform everything from basic descriptive statistics to advanced modeling with minimal code.Statistics plays a key role in understanding patterns, making informed decisio
2 min read