Open In App

Descriptive Statistic

Last Updated : 08 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Statistics is the foundation of data science. Descriptive statistics are simple tools that help us understand and summarize data. They show the basic features of a dataset, like the average, highest and lowest values and how spread out the numbers are. It's the first step in making sense of information.

descriptive_statistics
Descriptive Statistic

Types of Descriptive Statistics

There are three categories for standard classification of descriptive statistics methods, each serving different purposes in summarizing and describing data. They help us understand:

  1. Where the data centers (Measures of Central Tendency)
  2. How spread out the data is (Measure of Variability)
  3. How the data is distributed (Measures of Frequency Distribution)

1. Measures of Central Tendency

Statistical values that describe the central position within a dataset. There are three main measures of central tendency:

Measures of Central Tendency

Mean: is the sum of observations divided by the total number of observations. It is also defined as average which is the sum divided by count.

\bar{x}=\frac{\sum x}{n}

 where, 

  • x = Observations
  • n = number of terms

Let's look at an example of how can we find the mean of a data set using python code implementation.

Python
import numpy as np

# Sample Data
arr = [5, 6, 11]

# Mean
mean = np.mean(arr)

print("Mean = ", mean)

Output
Mean =  7.333333333333333

Mode: The most frequently occurring value in the dataset. It’s useful for categorical data and in cases where knowing the most common choice is crucial.

Python
import scipy.stats as stats

# sample Data
arr = [1, 2, 2, 3]

# Mode
mode = stats.mode(arr)
print("Mode = ", mode)

Output: 

Mode = ModeResult(mode=array([2]), count=array([2]))

Median: The median is the middle value in a sorted dataset. If the number of values is odd, it's the center value, if even, it's the average of the two middle values. It's often better than the mean for skewed data.

Python
import numpy as np

# sample Data
arr = [1, 2, 3, 4]

# Median
median = np.median(arr)

print("Median = ", median)

Output
Median =  2.5

Note : All implementations are performed using numpy library in python. If you want to learn and understand more about it. Refer to the link.

Central tendency measures are the foundation for understanding data distribution and identifying anomalies. For example, the mean can reveal trends, while the median highlights skewed distributions.

2. Measure of Variability

Knowing not just where the data centers but also how it spreads out is important. Measures of variability, also called measures of dispersion, help us spot the spread or distribution of observations in a dataset. They identifying outliers, assessing model assumptions and understanding data variability in relation to its mean. The key measures of variability include:

1. Range : describes the difference between the largest and smallest data point in our data set. The bigger the range, the more the spread of data and vice versa. While easy to compute range is sensitive to outliers. This measure can provide a quick sense of the data spread but should be complemented with other statistics.

Range = Largest data value - smallest data value 

Python
import numpy as np

# Sample Data
arr = [1, 2, 3, 4, 5]

# Finding Max
Maximum = max(arr)
# Finding Min
Minimum = min(arr)

# Difference Of Max and Min
Range = Maximum-Minimum
print("Maximum = {}, Minimum = {} and Range = {}".format(
    Maximum, Minimum, Range))

Output
Maximum = 5, Minimum = 1 and Range = 4

2. Variance: is defined as an average squared deviation from the mean. It is calculated by finding the difference between every data point and the average which is also known as the mean, squaring them, adding all of them and then dividing by the number of data points present in our data set.

\sigma ^ 2 = \frac{\sum\left(x-\mu \right )^2}{N}

where,

  • x -> Observation under consideration
  • N -> number of terms 
  • mu -> Mean 
Python
import statistics

# sample data
arr = [1, 2, 3, 4, 5]
# variance
print("Var = ", (statistics.variance(arr)))

Output
Var =  2.5

3. Standard deviation: Standard deviation is widely used to measure the extent of variation or dispersion in data. It's especially important when assessing model performance (e.g., residuals) or comparing datasets with different means.

It is defined as the square root of the variance. It is calculated by finding the mean, then subtracting each number from the mean which is also known as the average and squaring the result. Adding all the values and then dividing by the no of terms followed by the square root.

\sigma = \sqrt{\frac{\sum \left(x-\mu \right )^2}{N}} 

where, 

  • x = Observation under consideration
  • N = number of terms 
  • mu = Mean
Python
import statistics
arr = [1, 2, 3, 4, 5]
print("Std = ", (statistics.stdev(arr)))

Output
Std =  1.5811388300841898

Variability measures are important in residual analysis to check how well a model fits the data.

3. Measures of Frequency Distribution

Frequency distribution table is a powerful summarize way to show how data points are distributed across different categories or intervals. Helps identify patterns, outliers and the overall structure of the dataset. It is often the first step in understand the dataset before applying more advanced analytical methods or creating visualizations like histograms or pie charts.

Frequency Distribution Table Includes measure like:

  • Data intervals or categories
  • Frequency counts
  • Relative frequencies (percentages)
  • Cumulative frequencies when needed

For Frequency Distribution – Histogram, Bar Graph, Frequency Polygon and Pie Chart read article: Frequency Distribution – Table, Graphs, Formula


Next Article

Similar Reads