Empirical Distribution Function & Exploratory Data Analysis: Vijay Kumar
Empirical Distribution Function & Exploratory Data Analysis: Vijay Kumar
&
Exploratory Data Analysis
Vijay Kumar
October, 2022
Empirical Distribution Function Exploratory Data Analysis
Reference:
Order Statistics:
Then X(1) < X(2) < . . . < X(n) denotes the original random sample after
arrangement in increasing order of magnitude, and these are collectively termed
the order statistics of the random sample X1 , X2 . . . , Xn . The rth smallest,
1 ≤ r ≤ n, of the ordered X 0 s, X(r) , is called the rth -order statistics
EDF contd...
0 f or x < X(1)
i
Sn (x) = f or X(i−1) ≤ x < X(i) ; i = 1, 2, . . ., n
n
1 f or x ≥ X(n)
where X(1) < X(2) < . . . < X(n) is the ordered sample.
Clearly, Sn (x) is a step (or a jump) function, with jumps occurring at the
(distinct) ordered sample values, where the height of each jump is equal to the
reciprocal of the sample size i.e.1/n.
When more than one observation has the same value, we say these observations
are tied. In this case the EDF is still a step function but it jumps only at the
distinct ordered sample values X(j) and the height of the jump is equal to k/n,
where k is the number of data values tied at X(j) .
EDF contd...
0.0 ; x < 9.4
0.2 ; 9.4 ≤ x < 11.2
0.4 ; 11.2 ≤ x < 11.4
S5 (x) =
0.6 ; 11.4 ≤ x < 12.0
0.8 ; 12.0 ≤ x < 12.6
1.0 ; x ≥ 12.6
EDF contd...
EDF contd...
Let Tn (x) = n Sn (x), so that Tn (x) represents the total number of sample
values that are less than or equal to the specified value x.
Theorem: For any fixed real value x, the random variable Sn (x) has a binomial
distribution with parameters n and F (x).
j n j n−j
P Sn (x) = = [F (x)] [1 − F (x)] ; j = 0, 1, . . . , n
n j
Proof: For any fixed real constant x and i = 1, . . . , n, define the indicator
random variable
1 if Xi ≤ t
δi (t) =
0 ; otherwise
The random variables δ1 (t) , δ2 (t) , . . . , δn (t) are independent and identically
distributed, each with the Bernoulli distribution with parameter θ, where
EDF contd...
n
P
Now, we can write Tn (x) = n Sn (x) = δi (t)
i=1
Since it is the sum of n independent and identically distributed Bernoulli
random variables, it can be easily shown that Tn (x) = n Sn (x) has a binomial
distribution with parameters n and F (x). Therefore
j n j n−j
P Sn (x) = = [F (x)] [1 − F (x)] ; j = 0, 1, . . . , n
n j
EDF contd...
EDF contd...
Another useful property of the EDF is its asymptotic normality, given in the
following theorem.
Theorem: As n → ∞, the limiting probability distributionof the standardized
√
Sn (x) is standard normal, or lim P √n(Sn (x)−F (x)) ≤ t = Φ (t)
n→∞ F (x)(1−F (x))
EDA contd...
Boxplot
Histogram and
Density curve (Kernel density estimate).
M in(x), Q1 , Q2 , Q3 , M ax(x)
Quantile function :
For a continuous distribution F (x) , the p percentile (also referred to as
fractile or quantile), xp , for a given p, 0 < q < 1, is a number such that
P (X ≤ xp ) = F (xp ) = p
xp = F −1 (p) 1.0
p p=F(x)
The quantile for p = 0.25 and 0.75
are called first and third quartiles 0.75
0.25
xp=F -1(p)
Q1=F -1(0.25)
Q3=F -1(0.75)
Q2=F -1(0.5)
EDA : Boxplot
Boxplot contd...
Boxplot contd...
The box thus drawn defines the interquartile range (IQR). This is the
difference between the upper quartile and the lower quartile. We use the
IQR to give a measure of the amount of variability in the central portion
of the dataset, since about 50% of the data will lie within the box.
The lower whisker is drawn from the lower end of the box to the smallest
value that is no smaller than 1.5 IQR below the lower quartile. Similarly,
the upper whisker is drawn from the middle of the upper end of the box to
the largest value that is no larger than 1.5 IQR above the upper quartile.
The rationale for these definitions is that when data are drawn from the
normal distribution or other distributions with a similar shape, about 99%
of the observations will fall between the whiskers.
Histogram:
Histogram contd...
There are two types of histogram:
Frequency Histogram : We can get a reasonable impression of the
shape of a distribution by drawing a histogram; that is, a count of how
many observations fall within specified divisions (”bins”) of the x-axis
Probability Histogram : The idea of the non–parametric approach is to
avoid restrictive assumptions about the form of and to estimate this
directly from the data. A histogram is a simple nonparametric estimate
of a probability distribution.
Notice that we automatically got the ”correct” histogram where the area of a
column is proportional to the number. The y-axis is in density units (that is,
proportion of data per x unit), so that the total area of the histogram will be
1.
This is really just a change of scale on the y-axis, but it has the advantage that
it becomes possible to overlay the histogram with a corresponding theoretical
density function.
When drawn with a density scale: the AREA (NOT height) of each bar is the
proportion of observations in the interval the TOTAL AREA is 100% (or 1)
V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 21 / 22
Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram
Histogram contd...
C.I. x f req rel.f req.
c0 − c1 x1 f1 f1 /n 30
c1 − c2 x2 f2 f2 /n
.. .. .. ..
Freqency
y
. . . . 20
fi
ci−1 − ci xi fi fi /n
fi-1 fi+1
.. .. .. .. 10
. . . .
ck−1 − ck xk fk fk /n 0
f1 fk-1 fk
xi
Density
fi /n
fi-1 /n fi+1 /n
f1 /n fk-1/n f /n
k
xi