The scipy.stats is the SciPy sub-package. It is mainly used for probabilistic distributions and statistical operations. There is a wide range of probability functions.
There are three classes:
Class
| Description
|
rv_continuous | For continuous random variables, we can create specialized distribution subclasses and instances. |
rv_discrete | For discrete random variables, we can create specialized distribution subclasses and instances. |
rv_histogram | generate specific distribution histograms. |
Continuous Random Variables
A continuous random variable is a probability distribution when the random variable X can have any value. The mean is defined by the location (loc) keyword. The standard deviation is determined by the scale (scale) keyword.
As we discussed that using the rv_continuous class we can create distributed subclasses and instances so there is a method called 'norm' which inherits from rv_continuous and this function will calculate the CDF for us.
Let X be a continuous random variable with PDF( (f) and CDF (F).
PDF - Probability Density Function
The PDF of a continuous random variable x satisfies the following conditions. If f\left ( x \right )\geq 0 for all x\in \mathbb{R} here f is piecewise continuous.
\int_{-\infty}^{\infty}f\left ( x \right )dx=1
P\left ( a\leq X\leq b \right )=\int_{a}^{b}f\left ( x \right )dx
The CDF is found by integrating the PDF:
F\left ( x \right )=\int_{-\infty}^{x}f\left ( t \right )dt
The pdf can be found by differentiating the CDF:
f\left ( x \right )=\frac{\mathrm{d} }{\mathrm{d} x}\left [ F\left ( x \right ) \right ]
Python3
# Importing the numpy module for numpy array
import numpy as npy
# Importing the scipy.stats.norm
from scipy.stats import norm
# calculating the cdf for the numpy array
print(norm.cdf(npy.array([-2, 0, 2])))
Output:
[0.02275013 0.5 0.97724987]
Discrete Random Variables
Only a countable number of values can be assigned to discrete random variables. L is an additional integer parameter that can be added to any discrete distribution. The general distribution p and the standard distribution p0 have the following relationship:
p\left ( x \right )=p_{0 }\left ( x-L \right )
scipy.stats.circmean
Compute the circular mean for samples in a range. We will use the following function to calculate the circular mean:
Syntax:
scipy.stats.circmean(array, high=2*pi, low=0, axis=None, nan_policy='propagate')
where,
- Array - input array or samples.
- high (float or int ) - high boundary for sample. default high = 2 * pi.
- low ( float or int ) - low boundary for sample. default low = 0.
- axis ( int ) - Axis along which means are computed.
- nan_policy ( ‘propagate’, ‘raise’, ‘omit’ ) - Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, and ‘omit’ performs the calculations ignoring nan values. The default is ‘propagate’.
Python3
# importing the required package
from scipy.stats import circmean
# calculating the circular mean
print(circmean([0.4, 2.4, 3.6], high=4, low=2))
# | | |
# ------------- ------------ ------------
# sample array higher bound lower bound
Output:
2.254068341376122
scipy.stats.contingency.crosstab
Given the lists a and p, create a contingency table that counts the frequencies of the corresponding pairs.
Python3
# importing the required package
from scipy.stats.contingency import crosstab
# list p
a = ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B']
# list q
p = ['P', 'P', 'P', 'Q', 'R', 'R', 'Q', 'Q', 'R', 'R']
# result ndarray
print(crosstab(a, p))
# using the crosstab function and extracting
# the informations like - a's unique values,
# b's unique values and the final count of the pairs.
(auv, puv), cnt = crosstab(a, p)
# printing list a's unique values
print(auv)
# printing list p's unique values
print(puv)
# printing the count object which tells us
# the pairs count for each unique values of a and p.
print(cnt)
Output:
((array(['A', 'B'], dtype='<U1'), array(['P', 'Q', 'R'], dtype='<U1')), array([[2, 3, 0],
[1, 0, 4]]))
['A' 'B']
['P' 'Q' 'R']
[[2 3 0]
[1 0 4]]
Note - In the above output, we have a ndarray, which consists of the different other arrays. The first value (array(['A', 'B']), dtype='<U1') is basically the array of unique values in the list a, the second value (array(['P', 'Q', 'R']),dtype='<U1') is basically the array of unique values in the list p, and the third value is the frequency of each pair of list a and list p.
list a =
list b =
Result analysis
Above image observations -
A - P = 2
A - Q = 3
A - R = 0
Above image observations:
B - P = 1
B - Q = 0
B - R = 4
stats.describe()
This function basically calculates the several descriptive statistics of the argument array.
Syntax:
scipy.stats.describe(a, axis=0, ddof=1, bias=True, nan_policy='propagate')
where,
- Input array - array for which we want to generate the statistics.
- axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
- ddof ( int ) { # optional } - Delta Degrees for variance. Default ddof = 1.
- bias ( bool ) { # optional } - skewness and kurtosis calculations for statistical bias.
- nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
Return:
- nbos ( int or ndarray ) - length of data along axis value.
- minmax ( tuple of ndarrays or floats ) - Minimum and Maximum value of input array along the given axis.
- mean ( float or ndarray ) - mean of input array.
- variance ( ndarray or float ) - variance of input array along the given axis.
- skewness ( float or ndarray ) - skewness of input array along the given axis.
- kurtosis ( ndarray or float ) - kurtosis of input array along the given axis.
Python3
# importing the stats and numpy module
from scipy import stats as st
import numpy as npy
# ID input array
array = npy.array([10, 20, 30, 40, 50, 60, 70, 80])
# calling the describe function
print(st.describe(array))
Output:
DescribeResult(
nobs=8,
minmax=(10, 80),
mean=45.0,
variance=600.0,
skewness=0.0,
kurtosis=-1.2380952380952381)
Python3
# importing the stats and numpy module
from scipy import stats as st
import numpy as npy
# 2D array
nd = npy.array([[5, 6], [2, 3], [5, 5],\
[7, 9], [9, 8], [8, 7]])
# calling the describe function
print(st.describe(nd))
Output:
DescribeResult(nobs=6,
minmax=(array([2, 3]),
array([9, 9])),
mean=array([6. , 6.33333333]),
variance=array([6.4 , 4.66666667]),
skewness=array([-0.40594941, -0.3380617 ]),
kurtosis=array([-0.9140625, -0.96 ]))
scipy.stats.kurtosis
Kurtosis quantifies how much of a probability distribution's data are concentrated towards the mean as opposed to the tails.
Kurtosis is the fourth central moment divided by the square of the variance.
Syntax:
scipy.stats.kurtosis(a, axis=0, fisher=True, bias=True, nan_policy='propagate', *, keepdims=False
where,
- Input array - Data for which the kurtosis is calculated..
- axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
- fisher ( bool ) { # optional } - If True, Fisher’s definition is used. If False, Pearson’s definition is used.
- bias ( bool ) { # optional } - If False, then the calculations are corrected for statistical bias.
- nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
- keepdims( bool ) ( # optional ) - default is false. broadcast result correctly against the input array.
Returns:
- kurtosis array - along the given axis.
Python3
# importing the stats module
from scipy import stats as st
# the random dataset
dataset = st.norm.rvs(size=88)
# calling the kurtosis function
print(st.kurtosis(dataset))
Output:
0.04606780907050423
scipy.stats.mstats.zscore
The Z-score provides information on how far a given value deviates from the standard deviation. When a data point's Z-score is 0, it means that it has the same score as the mean.
Z = ( Observed Value ( x ) - mean ( μ ) ) / standard deviation ( σ )
Calculate the z score for each value in the input array in comparison to the sample mean and standard deviation.
Function parameters -
Syntax:
scipy.stats.mstats.zscore(a, axis=0, ddof=0, nan_policy='propagate')
where,
- Input array - sample input array.
- axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
- ddof ( int ) { # optional } - Degrees of freedom correction in the calculation of the standard deviation. The default value of ddof is 0.
- nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
Returns:
- zscore - array - The z-scores of input array a, normalised by mean and standard deviation.
Python3
# importing the stats module
from scipy import stats as st
# the random 1D ARRAY ( dataset )
dataset = [0.02, 0.5, 0.01, 0.33, 0.51, 1.0, 0.03]
# the random 2D ARRAY ( dataset )
nd = [[5.1, 6.1], [2.1, 3.1], [5.1, 5.1],\
[7.1, 9.1], [9.1, 8.1], [8.1, 7.1]]
# calling the kurtosis function
# 1D dataset
print(st.zscore(dataset))
# calling the kurtosis function
# 2D dataset
print(st.zscore(nd))
Output:
[-0.95649434 0.46555034 -0.98612027 -0.03809048 0.49517627 1.94684689
-0.92686841]
[[-0.4330127 -0.16903085]
[-1.73205081 -1.69030851]
[-0.4330127 -0.6761234 ]
[ 0.4330127 1.35224681]
[ 1.29903811 0.84515425]
[ 0.8660254 0.3380617 ]]
scipy.stats.skew
We can determine the direction of outliers from skewness. The tail of a distribution curve has a longer right side when there is a positive skew. Accordingly, the distribution curve's outliers are farther from the mean on the left and closer to it on the right. Skewness just conveys the direction of outliers; it doesn't provide information on the number of outliers.
Compute the sample skewness of a data set. Skewness should be close to zero for normally distributed data. A skewness value greater than zero indicates that the right tail of a unimodal continuous distribution has more weight.
Syntax:
scipy.stats.skew(a, axis=0, bias=True, nan_policy='propagate', *, keepdims=False)
where,
- Input array
- axis ( int , float ) { # optional } - Axis along which statistics are calculated. The default axis is 0.
- bias ( bool ) { # optional } - If False, then the calculations are corrected for statistical bias.
- nan_policy - { 'propagate','raise','omit' } { # optional ) - Handle the NAN inputs.
- keepdims( bool ) ( # optional ) - default is false. broadcast result correctly against the input array.
Return:
Python3
# importing the stats module
from scipy import stats as st
# ID input array
array = [99, 10, 30, 55, 50, 0, 90, 0]
# calling the skew function
print(st.skew(array))
Output:
0.3260023450293658
scipy.stats.energy_distance
Distance between two probability distributions. Suppose two distributions u and v and their CDF are U and V, two random variables X and Y are there, then the energy distance will be the square root of:
D2(U,V) = 2E || X - Y || - E || X - X' || - E || Y - Y' || > 0,
- || denotes the length of a vector
Compute the energy distance between two 1D distributions.
Python3
# importing the stats module
from scipy import stats as st
# calling the function
print(st.energy_distance([5, 10], [10, 20],\
[20, 30], [30, 40]))
Output:
2.851422845685634
scipy.stats.mode
Return an array of the most common values in the input array.
Python3
# importing the stats module
from scipy import stats as st
# sample input array
array = [[2, 3], [3, 1], [1, 3],\
[3, 3], [4, 2], [4, 4],\
[1, 2], [5, 6]]
# calling the mode function
print(st.mode(array))
Output:
ModeResult(mode=array([[1, 3]]), count=array([[2, 3]]))
scipy.stats.variation
The coefficient of variation - Standard deviation divided by the mean.
Python3
# importing the stats module
from scipy import stats as st
# sample input array
array = [[2, 3], [3, 1], [1, 3],\
[3, 3], [4, 2], [4, 4],\
[1, 2], [5, 6]]
# calling the function
print(st.variation(array, ddof=1))
Output:
[0.5070393 0.50395263]
scipy.stats.rankdata
Assign ranks to data, dealing with ties appropriately.
Python3
# importing the stats module
from scipy import stats as st
# sample input array
array = [2, 3, 15, 1, 6, 9, 8, 4, 5, 10]
# calling the function
print(st.rankdata(array))
Output:
[ 2. 3. 10. 1. 6. 8. 7. 4. 5. 9.]
Similar Reads
Interview Preparation
Practice @Geeksforgeeks
Data Structures
Algorithms
Programming Languages
Web Technologies
Computer Science Subjects
Data Science & ML
Tutorial Library
GATE CS