0% found this document useful (0 votes)
12 views

Empirical Distribution Function & Exploratory Data Analysis: Vijay Kumar

it is useful for statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Empirical Distribution Function & Exploratory Data Analysis: Vijay Kumar

it is useful for statistics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Empirical Distribution Function

&
Exploratory Data Analysis

Vijay Kumar

Department of Mathematics and Statistics


DDU Gorakhpur University, Gorakhpur
[email protected]

October, 2022
Empirical Distribution Function Exploratory Data Analysis

Reference:

Tukey, J.W. (1977). Exploratory Data Analysis. Addison Wesley.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 2 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

Order Statistics:

Let X1 , X2 . . . , Xn denote a random sample from a population with continuous


cdf F (x). First let FX be continuous, so that the probability is zero that any
two or more of these random variables have equal magnitudes. In this situation
there exists a unique ordered arrangement within the sample. Suppose that

X(1) : smallest of (X1 , X2 . . . , Xn )


X(2) : second smallest of (X1 , X2 . . . , Xn )
··· ··· ··· ···
X(r) : rth smallest of (X1 , X2 . . . , Xn )
··· ··· ··· ···
X(n) : largest of (X1 , X2 . . . , Xn )

Then X(1) < X(2) < . . . < X(n) denotes the original random sample after
arrangement in increasing order of magnitude, and these are collectively termed
the order statistics of the random sample X1 , X2 . . . , Xn . The rth smallest,
1 ≤ r ≤ n, of the ordered X 0 s, X(r) , is called the rth -order statistics

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 3 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

Order Statistics: Applications


Some familiar applications of order statistics, which are obvious on reflection, are as follows:
X(n) , the maximum (largest) value in the sample, is of interest in the study of floods
and other extreme meteorological phenomena.
X(1) , the minimum (smallest) value, is useful for phenomena where, for example, the
strength of a chain depends on the weakest link.
The sample median, defined as X[(n+1)/2] for n odd and any number between Xn/2
and X(n/2)+1 for n even, is a measure of location and an estimate of the population
central tendency.

The sample midrange, defined as X(1) + X(n) /2, is also a measure of central
tendency.
The sample range X(n) − X(1) is a measure of dispersion.
In some experiments, the sampling process ceases after collecting r of the
observations. For example, in life-testing electric light bulbs, one may start with a
group of n bulbs but stop taking observations after the rth bulb burns out. Then
information is available only on the first r ordered “lifetimes”
X(1) < X(2) < . . . < X(r) , where r ≤ n. This type of data is often referred to as
censored data.
Order statistics are used to study outliers or extreme observations, e.g., when so-called
dirty data are suspected.
Two general uses of order statistics in distribution-free inference, namely, interval estimation
and hypothesis testing of population percentiles. The topic of tolerance limits for distributions.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 4 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

The Empirical Distribution Function(EDF):

For a random sample X1 , . . . , Xn from the distribution F (x), the empirical


distribution function or EDF, denoted by Sn (x), is simply the proportion of
sample values less than or equal to the specified value x, that is,
#number of sample values ≤ x
Sn (x) =
n
The Sn (x) can be used as a point estimate of P (X ≤ x). This estimate is
called the empirical distribution function(EDF). The EDF is most conveniently
defined in terms of the order statistics of a sample.
Suppose that the n sample observations are distinct and arranged in increasing
order so that X(1) is the smallest, X(2) is the second smallest, . . ., and X(n)
is the largest.
An estimate of F (x) = P (X ≤ x) is the proportion of sample points that
fall in the interval (−∞, x].

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 5 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

The EDF of an observed sample X1 , X2 , . . . , Xn is defined as

0 f or x < X(1)



i

Sn (x) = f or X(i−1) ≤ x < X(i) ; i = 1, 2, . . ., n
n


1 f or x ≥ X(n)

where X(1) < X(2) < . . . < X(n) is the ordered sample.
Clearly, Sn (x) is a step (or a jump) function, with jumps occurring at the
(distinct) ordered sample values, where the height of each jump is equal to the
reciprocal of the sample size i.e.1/n.
When more than one observation has the same value, we say these observations
are tied. In this case the EDF is still a step function but it jumps only at the
distinct ordered sample values X(j) and the height of the jump is equal to k/n,
where k is the number of data values tied at X(j) .

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 6 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

Suppose that a random sample of size


n = 5 is given by

9.4, 11.2, 11.4, 12, and 12.6


The EDF of this sample is shown in
Figure.



 0.0 ; x < 9.4
0.2 ; 9.4 ≤ x < 11.2




0.4 ; 11.2 ≤ x < 11.4

S5 (x) =

 0.6 ; 11.4 ≤ x < 12.0
0.8 ; 12.0 ≤ x < 12.6




1.0 ; x ≥ 12.6

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 7 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

We now discuss some of the statistical properties of the edf Sn (x).


0 ≤ Sn (x) ≤ 1 Since Sn (x) is a relative frequency. Sn (x) is a step
function and the height of the step(jump) at the ordered statistics of the
sample and is equal to the relative frequency of the event X ≤ x
Sn (x) is a non-decreasing function since the number of observed values
less than or equal to x does not decrease as x increases For all values of
x less than the smallest observed value x Sn (x) = 0 and for all values of
x greater than or equal to the largest observed value of x Sn (x) = 1 i.e.
Sn (x) = 0 for x ≤ X(1) and Sn (x) = 1 for x ≥ X(n)
The function Sn (x) is called empirical distribution function(EDF)
because it is determined completely by the observed values of the random
variables and no assumptions about the underlying distribution of X are
necessary.
Sn (x) converges to F (x) in probability..
Sn (x) is itself a random variable.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 8 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

Let Tn (x) = n Sn (x), so that Tn (x) represents the total number of sample
values that are less than or equal to the specified value x.
Theorem: For any fixed real value x, the random variable Sn (x) has a binomial
distribution with parameters n and F (x).
   
j n j n−j
P Sn (x) = = [F (x)] [1 − F (x)] ; j = 0, 1, . . . , n
n j

Proof: For any fixed real constant x and i = 1, . . . , n, define the indicator
random variable 
1 if Xi ≤ t
δi (t) =
0 ; otherwise
The random variables δ1 (t) , δ2 (t) , . . . , δn (t) are independent and identically
distributed, each with the Bernoulli distribution with parameter θ, where

θ = P [δi (t) = 1] = P [Xi ≤ t] = F (t)

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 9 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

n
P
Now, we can write Tn (x) = n Sn (x) = δi (t)
i=1
Since it is the sum of n independent and identically distributed Bernoulli
random variables, it can be easily shown that Tn (x) = n Sn (x) has a binomial
distribution with parameters n and F (x). Therefore
   
j n j n−j
P Sn (x) = = [F (x)] [1 − F (x)] ; j = 0, 1, . . . , n
n j

Using properties of the binomial distribution, we get the following results.


Corollary 1 : The mean and the variance of Sn (x) are
(a) E [Sn (x)] = F (x)
1
(b) V ar [Sn (x)] = n F (x) [1 − F (x)]

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 10 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

Corollary 2 : For any fixed real value x; Sn (x) is a consistent estimator of


F (x), or, in other words, Sn (x) converges to F (x) in probability.
Since E [Sn (x)] = F (x), thus Sn (x) is an unbiased estimator of F (x).and
1
lim V ar [Sn (x)] = lim F (x) [1 − F (x)] → 0
n→∞ n→∞ n
Thus, using Chebyshev’s inequality, we can show that Sn (x) is a consistent
estimator of F (x).
Remark : The convergence is for each value of x individually, whereas
sometimes we are interested in all values of x, collectively. A probability
statement can be made simultaneously for all x, as a result of the following
important theorem. We have the following classical result [see Fisz (1963), for
example, for a proof].

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 11 / 22


Empirical Distribution Function Exploratory Data Analysis Order Statistics Definition Example Properties

EDF contd...

Theorem: (Glivenko-Cantelli Theorem) Sn (x) converges uniformly toF (x)


with probability 1, that is, for given ε > 0
 
lim P Sup |Sn (x) − F (x)| > ε = 0
n→∞ −∞<x<∞

Another useful property of the EDF is its asymptotic normality, given in the
following theorem.
Theorem: As n → ∞, the limiting probability distributionof the standardized


Sn (x) is standard normal, or lim P √n(Sn (x)−F (x)) ≤ t = Φ (t)
n→∞ F (x)(1−F (x))

Proof: Using Theorem, Corollary 1, and the central



limit theorem, it follows
(n Sn (x)−n F (x)) n(Sn (x)−F (x))
that the distribution of √ = √ approaches the
n F (x)(1−F (x)) F (x)(1−F (x))
standard normal as n → ∞

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 12 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Exploratory Data Analysis(EDA):

Exploratory data analysis refers to a collection of statistical methods that try


to explore important and interesting features in the data and utilize them in
process of empirical model building as well as for making useful inference and
prediction.
Exploratory data analysis is an approach to statistical analysis, heavily graphical
in nature, that attempts to maximize insight into data. EDA is a set of methods
to display and summarize the data:
displaying the data in a graph that shows overall patterns and unusual
observations
boxplot
histogram and
density curve etc.
computing descriptive statistics that summarize specific aspects of the
data
center
spread
skewness and
kurtosis etc.
V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 13 / 22
Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

EDA contd...

It allows data to speak for themselves, without making assumptions and


conducting formal analyses. The descriptive statistical methods quantitatively
describe the main features of data.
The main data features are
measures of central tendency(e.g. mean and median);
measures of variability (e.g., standard deviation) and
measures of relative standing (e.g., quantiles).
It uses simple statistical and graphical procedures that are as assumption free
as possible. The following graphs are used to perform exploratory data analysis:

Boxplot
Histogram and
Density curve (Kernel density estimate).

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 14 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Five Point Summary:

The five parameters

M in(x), Q1 , Q2 , Q3 , M ax(x)

are often referred to as the five-number summary. Together, these parameters


give a great deal of information about the distribution in terms of the :
centre
spread, and
skewness
Graphically, the five numbers are often displayed as a boxplot.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 15 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Quantile function :
For a continuous distribution F (x) , the p percentile (also referred to as
fractile or quantile), xp , for a given p, 0 < q < 1, is a number such that

P (X ≤ xp ) = F (xp ) = p

xp = F −1 (p) 1.0
p p=F(x)
The quantile for p = 0.25 and 0.75
are called first and third quartiles 0.75

and the 0.50 quantile is called the Quantile function


median(Q2 ).
0.50

0.25
xp=F -1(p)

Q1=F -1(0.25)
Q3=F -1(0.75)
Q2=F -1(0.5)

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 16 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

EDA : Boxplot

A box plot (or ”box-and-whisker plot”) is an alternative to a histogram to


give a quick visual display of the main features of a set of data. Boxplots
(box and whisker plot) summarise the data and display these summaries
in a box and whisker formation. It represents useful summaries for one
dimensional data.
The box represents the inter-quartile range (IQR) and shows the median
(line), first (lower edge of box) and third quartile (upper edge of box) of
the distribution
The box gives an indication of the location and spread of the central portion
of the data, while the extent of the lines (the ”whiskers”) provides an idea
of the range of the bulk of the data. In some implementations, outliers
(observations that are very different from the rest of the data) are plotted
as separate points.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 17 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Boxplot contd...

The basic construction of the box


part of the boxplot is as follows:
Outliers
A horizontal line is drawn at Q3 +1.5*IQR
the median. Upper whisker

Split the data into two halves,


Q3
each containing the median.
Calculate the upper and lower IQR Q2(median)
quartiles as the medians of
Q1
each half, and draw horizontal
lines at each of these values.
Lower whisker
Then connect the lines to
Q1 -1.5*IQR
form a rectangular box.
Outliers

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 18 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Boxplot contd...

The box thus drawn defines the interquartile range (IQR). This is the
difference between the upper quartile and the lower quartile. We use the
IQR to give a measure of the amount of variability in the central portion
of the dataset, since about 50% of the data will lie within the box.
The lower whisker is drawn from the lower end of the box to the smallest
value that is no smaller than 1.5 IQR below the lower quartile. Similarly,
the upper whisker is drawn from the middle of the upper end of the box to
the largest value that is no larger than 1.5 IQR above the upper quartile.
The rationale for these definitions is that when data are drawn from the
normal distribution or other distributions with a similar shape, about 99%
of the observations will fall between the whiskers.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 19 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Histogram:

Histograms are a useful graphic for displaying univariate data.


A histogram is a special kind of bar plot. It allows us to visualize the distribution
of values for a numerical variable.
Histograms can provide insights on skewness, behaviour in the tails, presence
of multi-modal behaviour, and data outliers; histograms can be compared to
the fundamental shapes associated with standard analytic distributions.
A histogram is a special type of bar chart that is used to show the frequency
distribution of a collection of numbers. Each bar represents the count of x
values that fall in the range indicated by the base of the bar. Usually all bars
should be the same width. In this case the height of each bar is proportional to
the number of observations in the corresponding interval. If bars have different
widths, then the area of the bar should be proportional to the count; in this way
the height represents the density (i.e. the frequency per unit of x). They break
up data into cells and display each cell as a bar or rectangle, where the height
is proportional to the number of points falling within each cell. The number of
breaks/classes can be defined if required.

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 20 / 22


Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Histogram contd...
There are two types of histogram:
Frequency Histogram : We can get a reasonable impression of the
shape of a distribution by drawing a histogram; that is, a count of how
many observations fall within specified divisions (”bins”) of the x-axis
Probability Histogram : The idea of the non–parametric approach is to
avoid restrictive assumptions about the form of and to estimate this
directly from the data. A histogram is a simple nonparametric estimate
of a probability distribution.
Notice that we automatically got the ”correct” histogram where the area of a
column is proportional to the number. The y-axis is in density units (that is,
proportion of data per x unit), so that the total area of the histogram will be
1.
This is really just a change of scale on the y-axis, but it has the advantage that
it becomes possible to overlay the histogram with a corresponding theoretical
density function.
When drawn with a density scale: the AREA (NOT height) of each bar is the
proportion of observations in the interval the TOTAL AREA is 100% (or 1)
V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 21 / 22
Empirical Distribution Function Exploratory Data Analysis Five Point Summary Quantile function Boxplot Histogram

Histogram contd...
C.I. x f req rel.f req.
c0 − c1 x1 f1 f1 /n 30

c1 − c2 x2 f2 f2 /n
.. .. .. ..

Freqency
y
. . . . 20
fi
ci−1 − ci xi fi fi /n
fi-1 fi+1
.. .. .. .. 10

. . . .
ck−1 − ck xk fk fk /n 0
f1 fk-1 fk

xi

Density

fi /n

fi-1 /n fi+1 /n

f1 /n fk-1/n f /n
k

xi

V. Kumar, DDU Gorakhpur University B.Sc.-III : EDA October 2022 22 / 22

You might also like