0% found this document useful (0 votes)
1K views

Probdist Ref

Probability Distributions A Basic Reference PDF generated using the open source mwlib toolkit. A probability distribution assigns a probability to each of the possible outcomes of a random experiment.

Uploaded by

richarddmorey
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Probdist Ref

Probability Distributions A Basic Reference PDF generated using the open source mwlib toolkit. A probability distribution assigns a probability to each of the possible outcomes of a random experiment.

Uploaded by

richarddmorey
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 256

Probability Distributions

A Basic Reference

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Tue, 18 Sep 2012 20:05:06 UTC

Contents
Articles
Probability distribution 1 8 8 10 17 20 30 37 46 51 55 62 70 70 76 134 134 162 176 176 187 198 201 208 212 220 230 230 241

Discrete distributions
Bernoulli distribution Binomial distribution Uniform distribution (discrete) Poisson distribution Beta-binomial distribution Negative binomial distribution Geometric distribution Multinomial distribution Categorical distribution Dirichlet distribution

Continuous Distributions on [a,b]


Uniform distribution (continuous) Beta distribution

Continuous Distributions on (-Inf, Inf)


Normal distribution Student's t-distribution

Continuous Distributions on [0,Inf)


Gamma distribution Pareto distribution Inverse-gamma distribution Chi-squared distribution F-distribution Log-normal distribution Exponential distribution

Multivariate Continuous Distributions


Multivariate normal distribution Wishart distribution

References
Article Sources and Contributors Image Sources, Licenses and Contributors 246 249

Article Licenses
License 253

Probability distribution

Probability distribution
In probability and statistics, a probability distribution assigns a probability to each of the possible outcomes of a random experiment. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variables, where the distribution is a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution is a probability density functions. More complex experiments, such as those involving stochastic processes defined in continuous-time, may demand the use of more general probability measures. In applied probability, a probability distribution can be specified in a number of different ways, often chosen for mathematical convenience: by supplying a valid probability mass function or probability density function by supplying a valid cumulative distribution function or survival function by supplying a valid hazard function by supplying a valid characteristic function

by supplying a rule for constructing a new random variable from other random variables whose joint probability distribution is known. Important and commonly encountered probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution.

Introduction
To define probability distributions for the simplest cases, one needs to distinguish between discrete and continuous random variables. In the discrete case, one can easily assign a probability to each possible value: when throwing a die, each of the six values 1 to 6 has the probability 1/6. In contrast, when a random variable takes values from a continuum, probabilities are nonzero only if they refer to finite intervals: in quality control one might demand that the probability of a "500g" package containing between 490g and 510g should be no less than 98%.
Discrete probability distribution for the sum of two dice.

Probability distribution

If the random variable is real-valued (or more generally, if a total order is defined for its possible values), the cumulative distribution function gives the probability that the random variable is no larger than a given value; in the real-valued case it is the integral of the density.

Terminology

Normal distribution, also called Gaussian or "bell curve", the most

important continuous random distribution. As probability theory is used in quite diverse applications, terminology is not uniform and sometimes confusing. The following terms are used for non-cumulative probability distribution functions:

Probability mass, Probability mass function, p.m.f.: for discrete random variables. Categorical distribution: for discrete random variables with a finite set of values. Probability density, Probability density function, p.d.f: Most often reserved for continuous random variables. The following terms are somewhat ambiguous as they can refer to non-cumulative or cumulative distributions, depending on authors' preferences: Probability distribution function: Continuous or discrete, non-cumulative or cumulative. Probability function: Even more ambiguous, can mean any of the above, or anything else. Finally, Probability distribution: Either the same as probability distribution function. Or understood as something more fundamental underlying an actual mass or density function.

Basic terms
Mode: most frequently occurring value in a distribution Tail: region of least frequently occurring values in a distribution Support: the smallest closed interval/set whose complement has probability zero. It may be understood as the points or elements that are actual members of the distribution.

Discrete probability distribution


A discrete probability distribution shall be understood as a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is then called a discrete random variable, if

as u runs through the set of all possible values of X. It follows that such a random variable can assume only a finite or countably infinite number of values.

In cases more frequently considered, this set of possible values is a topologically discrete set in the sense that all its points are isolated points. But there are discrete random variables for which this countable set is dense on the real line (for example, a distribution over rational numbers).

The probability mass function of a discrete probability distribution. The probabilities of the singletons {1}, {3}, and {7} are respectively 0.2, 0.5, 0.3. A set not containing any of these points has probability zero.

Probability distribution

Among the most well-known discrete probability distributions that are used for statistical modeling are the Poisson distribution, the Bernoulli distribution, the binomial distribution, the geometric distribution, and the negative binomial distribution. In addition, the discrete uniform distribution is commonly used in computer programs that make equal-probability random selections between a number of choices.

The cdf of a discrete probability distribution, ...

Cumulative density
Equivalently to the above, a discrete random variable can be defined as a random variable whose cumulative distribution function (cdf) increases only by jump discontinuitiesthat is, its cdf increases only where it "jumps" to a higher value, and is constant between those jumps. The points where jumps occur are precisely the values which the random variable may take. The number of such jumps may be finite or countably infinite. The set of locations of such jumps need not be topologically discrete; for example, the cdf might jump at each rational number.

... of a continuous probability distribution, ...

... of a distribution which has both a continuous part and a discrete part.

Delta-function representation
Consequently, a discrete probability distribution is often represented as a generalized probability density function involving Dirac delta functions, which substantially unifies the treatment of continuous and discrete distributions. This is especially useful when dealing with probability distributions involving both a continuous and a discrete part.

Indicator-function representation
For a discrete random variable X, let u0, u1, ... be the values it can take with non-zero probability. Denote These are disjoint sets, and by formula (1)

It follows that the probability that X takes any value except for u0, u1, ... is zero, and thus one can write X as

except on a set of probability zero, where definition of discrete random variables.

is the indicator function of A. This may serve as an alternative

Continuous probability distribution


A continuous probability distribution is a probability distribution that has a probability density function. Mathematicians also call such a distribution absolutely continuous, since its cumulative distribution function is absolutely continuous with respect to the Lebesgue measure . If the distribution of X is continuous, then X is called a continuous random variable. There are many examples of continuous probability distributions: normal, uniform, chi-squared, and others.

Probability distribution Intuitively, a continuous random variable is the one which can take a continuous range of values as opposed to a discrete distribution, where the set of possible values for the random variable is at most countable. While for a discrete distribution an event with probability zero is impossible (e.g. rolling 3 on a standard die is impossible, and has probability zero), this is not so in the case of a continuous random variable. For example, if one measures the width of an oak leaf, the result of 3cm is possible, however it has probability zero because there are uncountably many other potential values even between 3cm and 4cm. Each of these individual outcomes has probability zero, yet the probability that the outcome will fall into the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved by the fact that the probability that X attains some value within an infinite set, such as an interval, cannot be found by naively adding the probabilities for individual values. Formally, each value has an infinitesimally small probability, which statistically is equivalent to zero. Formally, if X is a continuous random variable, then it has a probability density function (x), and therefore its probability of falling into a given interval, say [a, b] is given by the integral

In particular, the probability for X to take any single value a (that is a X a) is zero, because an integral with coinciding upper and lower limits is always equal to zero. The definition states that a continuous probability distribution must possess a density, or equivalently, its cumulative distribution function be absolutely continuous. This requirement is stronger than simple continuity of the cdf, and there is a special class of distributions, singular distributions, which are neither continuous nor discrete nor their mixture. An example is given by the Cantor distribution. Such singular distributions however are never encountered in practice. Note on terminology: some authors use the term "continuous distribution" to denote the distribution with continuous cdf. Thus, their definition includes both the (absolutely) continuous and singular distributions. By one convention, a probability distribution is called continuous if its cumulative distribution function for all . is continuous and, therefore, the probability measure of singletons

Another convention reserves the term continuous probability distribution for absolutely continuous distributions. These distributions can be characterized by a probability density function: a non-negative Lebesgue integrable function defined on the real numbers such that

Discrete distributions and some continuous distributions (like the Cantor distribution) do not admit such a density. To better understand continuous distributions, take this example of probability. Tim Cook Associate Professor Kevin Gue taught this in his lecture in Stochastic Operations at Auburn University: Assume you are playing golf. What is the probability that you can hit the golf ball exactly, on the dot, 200 yards? Answer: 0 This is not directly intuitive. Your sense wants you to think that there must be some small probability that you can make that ball stop at 200 yards. After all it is there right? Well because we are evaluating it continuously, There are infinite points you could hit the ball (ex: 199.2304930234930 yards). Therefore, since your possiblilities are infinity, your chances of hitting 200yds becomes zero. You have a zero percent chance of hitting the ball on that specific spot.

Probability distribution

Probability distributions of scalar random variables


The following applies to all types of scalar random variables. Because a probability distribution Pr on the real line is determined by the probability of a scalar random variable X being in a half-open interval (-,x], the probability distribution is completely characterized by its cumulative distribution function:

Some properties
The probability distribution of the sum of two independent random variables is the convolution of each of their distributions. Probability distributions are not a vector space they are not closed under linear combinations, as these do not preserve non-negativity or total integral 1 but they are closed under convex combination, thus forming a convex subset of the space of functions (or measures).

Kolmogorov definition
In the measure-theoretic formalization of probability theory, a random variable is defined as a measurable function X from a probability space to measurable space . A probability distribution is the pushforward measure, P, satisfying X*P=PX 1 on .

Random number generation


A frequent problem in statistical simulations (the Monte Carlo method) is the generation of pseudo-random numbers that are distributed in a given way. Most algorithms are based on a pseudorandom number generator that produces numbers X that are uniformly distributed in the interval [0,1). These random variates X are then transformed via some algorithm to create a new random variate having the required probability distribution.

Applications
The concept of the probability distribution and the random variables which they describe underlies the mathematical discipline of probability theory, and the science of statistics. There is spread or variability in almost any value that can be measured in a population (e.g. height of people, durability of a metal, sales growth, traffic flow, etc.); almost all measurements are made with some intrinsic error; in physics many processes are described probabilistically, from the kinetic properties of gases to the quantum mechanical description of fundamental particles. For these and many other reasons, simple numbers are often inadequate for describing a quantity, while probability distributions are often more appropriate. As a more specific example of an application, the cache language models and other statistical language models used in natural language processing to assign probabilities to the occurrence of particular words and word sequences do so by means of probability distributions.

Probability distribution

Common probability distributions


The following is a list of some of the most common probability distributions, grouped by the type of process that they are related to. For a more complete list, see list of probability distributions, which groups by the nature of the outcome being considered (discrete, continuous, multivariate, etc.) Note also that all of the univariate distributions below are singly peaked; that is, it is assumed that the values cluster around a single point. In practice, actually observed quantities may cluster around multiple values. Such quantities can be modeled using a mixture distribution.

Related to real-valued quantities that grow linearly (e.g. errors, offsets)


Normal distribution (Gaussian distribution), for a single such quantity; the most common continuous distribution

Related to positive real-valued quantities that grow exponentially (e.g. prices, incomes, populations)
Log-normal distribution, for a single such quantity whose log is normally distributed Pareto distribution, for a single such quantity whose log is exponentially distributed; the prototypical power law distribution

Related to real-valued quantities that are assumed to be uniformly distributed over a (possibly unknown) region
Discrete uniform distribution, for a finite set of values (e.g. the outcome of a fair die) Continuous uniform distribution, for continuously distributed values

Related to Bernoulli trials (yes/no events, with a given probability)


Basic distributions: Bernoulli distribution, for the outcome of a single Bernoulli trial (e.g. success/failure, yes/no) Binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of independent occurrences Negative binomial distribution, for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs Geometric distribution, for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special case of the negative binomial distribution Related to sampling schemes over a finite population: Hypergeometric distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, using sampling without replacement Beta-binomial distribution, for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed number of total occurrences, sampling using a Polya urn scheme (in some sense, the "opposite" of sampling without replacement)

Probability distribution

Related to categorical outcomes (events with K possible outcomes, with a given probability for each outcome)
Categorical distribution, for a single categorical outcome (e.g. yes/no/maybe in a survey); a generalization of the Bernoulli distribution Multinomial distribution, for the number of each type of categorical outcome, given a fixed number of total outcomes; a generalization of the binomial distribution Multivariate hypergeometric distribution, similar to the multinomial distribution, but using sampling without replacement; a generalization of the hypergeometric distribution

Related to events in a Poisson process (events that occur independently with a given rate)
Poisson distribution, for the number of occurrences of a Poisson-type event in a given period of time Exponential distribution, for the time before the next Poisson-type event occurs

Useful for hypothesis testing related to normally distributed outcomes


Chi-squared distribution, the distribution of a sum of squared standard normal variables; useful e.g. for inference regarding the sample variance of normally distributed samples (see chi-squared test) Student's t distribution, the distribution of the ratio of a standard normal variable and the square root of a scaled chi squared variable; useful for inference regarding the mean of normally distributed samples with unknown variance (see Student's t-test) F-distribution, the distribution of the ratio of two scaled chi squared variables; useful e.g. for inferences that involve comparing variances or involving R-squared (the squared correlation coefficient)

Useful as conjugate prior distributions in Bayesian inference


Beta distribution, for a single probability (real number between 0 and 1); conjugate to the Bernoulli distribution and binomial distribution Gamma distribution, for a non-negative scaling parameter; conjugate to the rate parameter of a Poisson distribution or exponential distribution, the precision (inverse variance) of a normal distribution, etc. Dirichlet distribution, for a vector of probabilities that must sum to 1; conjugate to the categorical distribution and multinomial distribution; generalization of the beta distribution Wishart distribution, for a symmetric non-negative definite matrix; conjugate to the inverse of the covariance matrix of a multivariate normal distribution; generalization of the gamma distribution

References
B. S. Everitt: The Cambridge Dictionary of Statistics, Cambridge University Press, Cambridge (3rd edition, 2006). ISBN 0-521-69027-7 Bishop: Pattern Recognition and Machine Learning, Springer, ISBN 0-387-31073-8

External links
Hazewinkel, Michiel, ed. (2001), "Probability distribution" (http://www.encyclopediaofmath.org/index. php?title=p/p074900), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4

Discrete distributions
Bernoulli distribution
Bernoulli Parameters Support PMF

CDF

Mean Median

Mode

Variance Skewness Ex. kurtosis Entropy MGF CF PGF

In probability theory and statistics, the Bernoulli distribution, named after Swiss scientist Jacob Bernoulli, is a discrete probability distribution, which takes value 1 with success probability and value 0 with failure probability . So if X is a random variable with this distribution, we have:

A classical example of a Bernoulli experiment is a single toss of a coin. The coin might come up heads with probability p and tails with probability 1-p. The experiment is called fair if p=0.5, indicating the origin of the terminology in betting (the bet is fair if both possible outcomes have the same probability). The probability mass function f of this distribution is

Bernoulli distribution This can also be expressed as

The expected value of a Bernoulli random variable X is

, and its variance is

The above can be derived from the Bernoulli distribution as a special case of the Binomial distribution.[1] The kurtosis goes to infinity for high and low values of p, but for kurtosis than any other probability distribution, namely -2. The Bernoulli distribution is a member of the exponential family. The maximum likelihood estimator of p based on a random sample is the sample mean. the Bernoulli distribution has a lower

Related distributions
If are independent, identically distributed (i.i.d.) random variables, all Bernoulli distributed with (binomial distribution). The Bernoulli success probabilityp, then

distribution is simply . The categorical distribution is the generalization of the Bernoulli distribution for variables with any constant number of discrete values. The Beta distribution is the conjugate prior of the Bernoulli distribution. The geometric distribution is the number of Bernoulli trials needed to get one success.

Notes
[1] McCullagh and Nelder (1989), Section 4.2.2.

References
McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN0-412-31760-5. Johnson, N.L., Kotz, S., Kemp A. (1993) Univariate Discrete Distributions (2nd Edition). Wiley. ISBN 0-471-54897-9

External links
Hazewinkel, Michiel, ed. (2001), "Binomial distribution" (http://www.encyclopediaofmath.org/index. php?title=p/b016420), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Weisstein, Eric W., " Bernoulli Distribution (http://mathworld.wolfram.com/BernoulliDistribution.html)" from MathWorld.

Binomial distribution

10

Binomial distribution
Probability mass function

Cumulative distribution function

Notation Parameters Support PMF CDF Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF

B(n, p) n N0 number of trials p [0,1] success probability in each trial k { 0, , n } number of successes

np np or np (n + 1)p or (n + 1)p 1 np(1p)

Binomial distribution

11
CF PGF Fisher information

In probability theory and statistics, the binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when n = 1, the binomial distribution is a Bernoulli distribution. The binomial distribution is the basis for the popular binomial test of statistical significance.
Binomial distribution for

The binomial distribution is frequently used to model the number of successes in a sample of size n drawn with replacement from a population of size N. If the sampling is carried out without replacement, the draws are not independent and so the resulting distribution is a hypergeometric distribution, not a binomial one. However, for N much larger than n, the binomial distribution is a good approximation, and widely used.

(blue),

(green) and

(red)

Specification
Probability mass function
In general, if the random variable K follows the binomial distribution with parameters n and p, we write K~B(n,p). The probability of getting exactly k successes in n trials is given by the probability mass function:

Binomial distribution for with and as in Pascal's triangle ) ends up in the . ) is The probability that a ball in a Galton box with 8 layers ( central bin (

for k=0,1,2,...,n, where

Binomial distribution is the binomial coefficient (hence the name of the distribution) "nchoosek", also denoted C(n,k),nCk, or nCk. The formula can be understood as follows: we want k successes (pk) and nk failures (1p)nk. However, the k successes can occur anywhere among the n trials, and there are C(n,k) different ways of distributing k successes in a sequence of n trials. In creating reference tables for binomial distribution probability, usually the table is filled in up to n/2 values. This is because for k>n/2, the probability can be calculated by its complement as

12

Looking at the expression (k,n,p) as a function of k, there is a k value that maximizes it. This k value can be found by calculating

and comparing it to 1. There is always an integer M that satisfies

(k,n,p) is monotone increasing for k<M and monotone decreasing for k>M, with the exception of the case where (n+1)p is an integer. In this case, there are two values for which is maximal: (n+1)p and (n+1)p1. M is the most probable (most likely) outcome of the Bernoulli trials and is called the mode. Note that the probability of it occurring can be fairly small.

Cumulative distribution function


The cumulative distribution function can be expressed as:

where

is the "floor" under x, i.e. the greatest integer less than or equal to x.

It can also be represented in terms of the regularized incomplete beta function, as follows:

For k np, upper bounds for the lower tail of the distribution function can be derived. In particular, Hoeffding's inequality yields the bound

and Chernoff's inequality can be used to derive the bound

Moreover, these bounds are reasonably tight when p = 1/2, since the following expression holds for all k 3n/8[1]

Binomial distribution

13

Mean and variance


If X ~ B(n, p) (that is, X is a binomially distributed random variable), then the expected value of X is

and the variance is

Mode and median


Usually the mode of a binomial B(n, p) distribution is equal to , where is the floor function. However when (n+1)p is an integer and p is neither 0 nor 1, then the distribution has two modes: (n+1)p and (n+1)p1. When p is equal to 0 or 1, the mode will be 0 and n correspondingly. These cases can be summarized as follows:

In general, there is no single formula to find the median for a binomial distribution, and it may even be non-unique. However several special results have been established: If np is an integer, then the mean, median, and mode coincide and equal np.[2][3] Any median m must lie within the interval npmnp.[4] A median m cannot lie too far away from the mean: |m np| min{ ln 2, max{p, 1 p} }.[5] The median is unique and equal to m=round(np) in cases when either p 1 ln 2 or p ln 2 or |mnp|min{p,1p} (except for the case when p= and n is odd).[4][5] When p=1/2 and n is odd, any number m in the interval (n1)m(n+1) is a median of the binomial distribution. If p=1/2 and n is even, then m=n/2 is the unique median.

Covariance between two binomials


If two binomially distributed random variables X and Y are observed together, estimating their covariance can be useful. Using the definition of covariance, in the case n=1 (thus being Bernoulli trials) we have

The first term is non-zero only when both X and Y are one, and X and Y are equal to the two probabilities. Defining pB as the probability of both happening at the same time, this gives and for n such trials again due to independence

If X and Y are the same variable, this reduces to the variance formula given above.

Binomial distribution

14

Relationship to other distributions


Sums of binomials
If X~B(n,p) and Y~B(m,p) are independent binomial variables with the same probability p, then X+Y is again a binomial variable; its distribution is

Conditional binomials
If X~B(n,p) and, conditional on X, Y~B(X,q), then Y is a simple binomial variable with distribution

Bernoulli distribution
The Bernoulli distribution is a special case of the binomial distribution, where n=1. Symbolically, X~B(1,p) has the same meaning as X~Bern(p). Conversely, any binomial distribution, B(n,p), is the sum of n independent Bernoulli trials, Bern(p), each with the same probability p.

Poisson binomial distribution


The binomial distribution is a special case of the Poisson binomial distribution, which is a sum of n independent non-identical Bernoulli trials Bern(pi). If X has the Poisson binomial distribution with p1==pn=p then X~B(n,p).

Normal approximation
If n is large enough, then the skew of the distribution is not too great. In this case a reasonable approximation to B(n,p) is given by the normal distribution

and this basic approximation can be improved in a simple way by using a suitable continuity correction. The basic approximation generally improves as n increases (at least 20) and is better when p is not near to 0 or 1.[6] Various rules of thumb may be used to decide whether n is large enough, and p is far enough from the extremes of zero or one: One rule is that both x=np and n(1p) must be greater than5. However, the specific number varies from source to source, and depends on how good an Binomial PDF and normal approximation for n=6 and p=0.5 approximation one wants; some sources give 10 which gives virtually the same results as the following rule for large n until n is very large (ex: x=11, n=7752). A second rule[6] is that for n > 5 the normal approximation is adequate if

Another commonly used rule holds that the normal approximation is appropriate only if everything within 3 standard deviations of its mean is within the range of possible values, that is if

Binomial distribution

15

The following is an example of applying a continuity correction. Suppose one wishes to calculate Pr(X8) for a binomial random variable X. If Y has a distribution given by the normal approximation, then Pr(X8) is approximated by Pr(Y8.5). The addition of 0.5 is the continuity correction; the uncorrected normal approximation gives considerably less accurate results. This approximation, known as de MoivreLaplace theorem, is a huge time-saver when undertaking calculations by hand (exact calculations with large n are very onerous); historically, it was the first use of the normal distribution, introduced in Abraham de Moivre's book The Doctrine of Chances in 1738. Nowadays, it can be seen as a consequence of the central limit theorem since B(n,p) is a sum of n independent, identically distributed Bernoulli variables with parameterp. This fact is the basis of a hypothesis test, a "proportion z-test," for the value of p using x/n, the sample proportion and estimator of p, in a common test statistic.[7] For example, suppose one randomly samples n people out of a large population and ask them whether they agree with a certain statement. The proportion of people who agree will of course depend on the sample. If groups of n people were sampled repeatedly and truly randomly, the proportions would follow an approximate normal distribution with mean equal to the true proportion p of agreement in the population and with standard deviation =(p(1p)/n)1/2. Large sample sizes n are good because the standard deviation, as a proportion of the expected value, gets smaller, which allows a more precise estimate of the unknown parameterp.

Poisson approximation
The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed. Therefore the Poisson distribution with parameter = np can be used as an approximation to B(n, p) of the binomial distribution if n is sufficiently large and p is sufficiently small. According to two rules of thumb, this approximation is good if n20 and p0.05, or if n100 and np10.[8]

Limiting distributions
Poisson limit theorem: As n approaches and p approaches 0 while np remains fixed at >0 or at least np approaches >0, then the Binomial(n,p) distribution approaches the Poisson distribution with expected value . de MoivreLaplace theorem: As n approaches while p remains fixed, the distribution of

approaches the normal distribution with expected value0 and variance1. This result is sometimes loosely stated by saying that the distribution of X is asymptotically normal with expected valuenp and variancenp(1p). This result is a specific case of the central limit theorem.

Confidence intervals
Even for quite large values of n, the actual distribution of the mean is signicantly nonnormal.[9] Because of this problem several methods to estimate confidence intervals have been proposed. Let n1 be the number of successes out of n, the total number of trials, and let

be the proportion of successes. Let z/2 be the 100 ( 1 / 2 )th percentile of the standard normal distribution. Wald method

Binomial distribution A continuity correction of 0.5/n may be added. Agresti-Coull method[10]

16

Here the estimate of p is modified to

ArcSine method[11]

Wilson (score) method[12]

The exact (Clopper-Pearson) method is the most conservative.[9] The Wald method although commonly recommended in the text books is the most biased.

Generating binomial random variates


Methods for random number generation where the marginal distribution is a binomial distribution are well-established. [13][14]

References
[1] Matouek, J, Vondrak, J: The Probabilistic Method (lecture notes) (http:/ / kam. mff. cuni. cz/ ~matousek/ prob-ln. ps. gz). [2] Neumann, P. (1966). "ber den Median der Binomial- and Poissonverteilung" (in German). Wissenschaftliche Zeitschrift der Technischen Universitt Dresden 19: 2933. [3] Lord, Nick. (July 2010). "Binomial averages when the mean is an integer", The Mathematical Gazette 94, 331-332. [4] Kaas, R.; Buhrman, J.M. (1980). "Mean, Median and Mode in Binomial Distributions". Statistica Neerlandica 34 (1): 1318. doi:10.1111/j.1467-9574.1980.tb00681.x. [5] Hamza, K. (1995). "The smallest uniform upper bound on the distance between the mean and the median of the binomial and Poisson distributions". Statistics & Probability Letters 23: 2125. doi:10.1016/0167-7152(94)00090-U. [6] Box, Hunter and Hunter (1978). Statistics for experimenters. Wiley. p.130. [7] NIST/SEMATECH, "7.2.4. Does the proportion of defectives meet requirements?" (http:/ / www. itl. nist. gov/ div898/ handbook/ prc/ section2/ prc24. htm) e-Handbook of Statistical Methods. [8] NIST/SEMATECH, "6.3.3.1. Counts Control Charts" (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm), e-Handbook of Statistical Methods. [9] Brown LD, Cai T. and DasGupta A (2001). Interval estimation for a binomial proportion (with discussion). Statist Sci 16: 101133 [10] Agresti A, Coull BA (1998) "Approximate is better than 'exact' for interval estimation of binomial proportions". The American Statistician 52:119126 [11] Pires MA () Confidence intervals for a binomial proportion: comparison of methods and software evaluation. [12] Wilson EB (1927) "Probable inference, the law of succession, and statistical inference". Journal of the American Statistical Association 22: 209212 [13] Devroye, Luc (1986) Non-Uniform Random Variate Generation, New York: Springer-Verlag. (See especially Chapter X, Discrete Univariate Distributions (http:/ / cg. scs. carleton. ca/ ~luc/ chapter_ten. pdf)) [14] Kachitvichyanukul, V.; Schmeiser, B. W. (1988). "Binomial random variate generation". Communications of the ACM 31 (2): 216222. doi:10.1145/42372.42381.

Uniform distribution (discrete)

17

Uniform distribution (discrete)


discrete uniform Probability mass function

n = 5 where n = ba+1 Cumulative distribution function

Parameters

Support PMF CDF

Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF N/A
[1]

Uniform distribution (discrete) In probability theory and statistics, the discrete uniform distribution is a probability distribution whereby a finite number of equally spaced values are equally likely to be observed; every one of n values has equal probability 1/n. Another way of saying "discrete uniform distribution" would be "a known, finite number of equally spaced outcomes equally likely to happen." If a random variable has any of possible values that are equally spaced and equally probable, is . A simple example of the are 1, 2, 3, 4, 5, 6; and each time the die

18

then it has a discrete uniform distribution. The probability of any outcome discrete uniform distribution is throwing a fair die. The possible values of

is thrown, the probability of a given score is 1/6. If two dice are thrown and their values added, the uniform distribution no longer fits since the values from 2 to 12 do not have equal probabilities. The cumulative distribution function (CDF) of the discrete uniform distribution can be expressed in terms of a degenerate distribution as

where the Heaviside step function convention that

is the CDF of the degenerate distribution centered at

, using the

Estimation of maximum
This example is described by saying that a sample of k observations is obtained from a uniform distribution on the integers , with the problem being to estimate the unknown maximum N. This problem is commonly known as the German tank problem, following the application of maximum estimation to estimates of German tank production during World War II. The UMVU estimator for the maximum is given by

where m is the sample maximum and k is the sample size, sampling without replacement.[2][3] This can be seen as a very simple case of maximum spacing estimation. The formula may be understood intuitively as: "The sample maximum plus the average gap between observations in the sample", the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum.[4] This has a variance of[2]

so a standard deviation of approximately above.

, the (population) average size of a gap between samples; compare

The sample maximum is the maximum likelihood estimator for the population maximum, but, as discussed above, it is biased. If samples are not numbered but are recognizable or markable, one can instead estimate population size via the capture-recapture method.

Uniform distribution (discrete)

19

Random permutation
See rencontres numbers for an account of the probability distribution of the number of fixed points of a uniformly distributed random permutation.

Notes
[1] http:/ / adorio-research. org/ wordpress/ ?p=519 [2] Johnson, Roger (1994), "Estimating the Size of a Population", Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ index. htm) 16 (2 (Summer)), doi:10.1111/j.1467-9639.1994.tb00688.x [3] Johnson, Roger (2006), "Estimating the Size of a Population" (http:/ / www. rsscse. org. uk/ ts/ gtb/ johnson. pdf), Getting the Best from Teaching Statistics (http:/ / www. rsscse. org. uk/ ts/ gtb/ contents. html), [4] The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to underestimate the population maximum.

References

Poisson distribution

20

Poisson distribution
Poisson Probability mass function

The horizontal axis is the index k, the number of occurrences. The function is only defined at integer values of k. The connecting lines are only guides for the eye. Cumulative distribution function

The horizontal axis is the index k, the number of occurrences. The CDF is discontinuous at the integers of k and flat everywhere else because a variable that is Poisson distributed only takes on integer values. Notation Parameters Support PMF CDF > 0 (real) k { 0, 1, 2, 3, ... }

--or--

(for

where

is the

Incomplete gamma function and Mean Median Mode Variance Skewness Ex. kurtosis

is the floor function)

Poisson distribution

21

Entropy

(for large

MGF CF PGF

In probability theory and statistics, the Poisson distribution (pronounced [pwas]) is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.[1] (The Poisson distribution can also be used for the number of events in other specified intervals such as distance, area or volume.) Suppose someone typically gets on the average 4 pieces of mail per day. There will be however a certain spread: sometimes a little more, sometimes a little less, once in a while nothing at all.[2] Given only the average rate, for a certain period of observation (pieces of mail per day, phonecalls per hour, etc.), and assuming that the process, or mix of processes, that produce the event flow are essentially random, the Poisson distribution specifies how likely it is that the count will be 3, or 5, or 11, or any other number, during one period of observation. That is, it predicts the degree of spread around a known average rate of occurrence.[2] The distribution's practical usefulness has been explained by the Poisson law of small numbers.[3]

History
The distribution was first introduced by Simon Denis Poisson (17811840) and published, together with his probability theory, in 1837 in his work Recherches sur la probabilit des jugements en matire criminelle et en matire civile (Research on the Probability of Judgments in Criminal and Civil Matters).[4] The work focused on certain random variables N that count, among other things, the number of discrete occurrences (sometimes called arrivals) that take place during a time-interval of given length. A practical application of this distribution was made by Ladislaus Bortkiewicz in 1898 when he was given the task of investigating the number of soldiers in the Prussian army killed accidentally by horse kick; this experiment introduced the Poisson distribution to the field of reliability engineering.[5]

Definition
A discrete stochastic variable X is said to have a Poisson distribution with parameter >0, if for k = 0, 1, 2, ... the probability mass function of X is given by:

where e is the base of the natural logarithm (e = 2.71828...) k! is the factorial of k. The positive real number is equal to the expected value of X, but also to the variance:

The Poisson distribution can be applied to systems with a large number of possible events, each of which is rare. The Poisson distribution is sometimes called a Poissonian.

Poisson distribution

22

Properties
Mean
The expected value of a Poisson-distributed random variable is equal to and so is its variance. The coefficient of variation is , while the index of dispersion is 1.[6] The mean deviation about the mean is[6]

The mode of a Poisson-distributed random variable with non-integer is equal to

, which is the largest

integer less than or equal to . This is also written as floor(). When is a positive integer, the modes are and 1. All of the cumulants of the Poisson distribution are equal to the expected value . The nth factorial moment of the Poisson distribution is n.

Median
Bounds for the median ( ) of the distribution are known and are sharp:[7]

Higher moments
The higher moments mk of the Poisson distribution about the origin are Touchard polynomials in :

where

are Stirling numbers of the second kind.[8] The coefficients of the polynomials have a combinatorial meaning. In fact, when the expected value of the Poisson distribution is 1, then Dobinski's formula says that the nth moment equals the number of partitions of a set of size n. Sums of Poisson-distributed random variables: If are independent, and , then .[9]

A converse is Raikov's theorem, which says that if the sum of two independent random variables is Poisson-distributed, then so is each of those two independent random variables.[10]

Poisson distribution

23

Other properties
The Poisson distributions are infinitely divisible probability distributions.[11][12] The directed Kullback-Leibler divergence between Pois() and Pois(0) is given by

Bounds for the tail probabilities of a Poisson random variable bound argument.[13]

can be derived using a Chernoff

Related distributions
If and are independent, then the difference are independent, then the distribution of , follows a conditional on Skellam distribution. If and

is a binomial distribution. Specifically, given with parameters 1, 2,..., n then given

. More generally, if X1, X2,..., Xn are independent Poisson random variables

In

fact,

. The Poisson distribution can be derived as a limiting case to the binomial distribution as the number of trials goes to infinity and the expected number of successes remains fixed see law of rare events below. Therefore it can be used as an approximation of the binomial distribution if n is sufficiently large and p is sufficiently small. There is a rule of thumb stating that the Poisson distribution is a good approximation of the binomial distribution if n is at least 20 and p is smaller than or equal to 0.05, and an excellent approximation if n 100 and np 10.[14]

The Poisson distribution is a special case of generalized stuttering Poisson distribution (or stuttering Poisson distribution) with only a parameter.[15] Stuttering Poisson distribution can be deduced from the limiting distribution of multinomial distribution. For sufficiently large values of , (say >1000), the normal distribution with mean and variance (standard deviation ), is an excellent approximation to the Poisson distribution. If is greater than about 10, then the normal distribution is a good approximation if an appropriate continuity correction is performed, i.e., P(Xx), where (lower-case) x is a non-negative integer, is replaced by P(Xx+0.5). Variance-stabilizing transformation: When a variable is Poisson distributed, its square root is approximately normally distributed with expected value of about and variance of about 1/4.[16][17] Under this transformation, the convergence to normality (as increases) is far faster than the untransformed variable. Other, slightly more complicated, variance stabilizing transformations are available,[17] one of which is Anscombe transform. See Data transformation (statistics) for more general uses of transformations. If for every t > 0 the number of arrivals in the time interval [0,t] follows the Poisson distribution with mean t, then the sequence of inter-arrival times are independent and identically distributed exponential random variables

Poisson distribution having mean 1 / .[18] The cumulative distribution functions of the Poisson and chi-squared distributions are related in the following ways:[19] and[20]

24

Occurrence
Applications of the Poisson distribution can be found in many fields related to counting: Electrical system example: telephone calls arriving in a system. Astronomy example: photons arriving at a telescope. Biology example: the number of mutations on a strand of DNA per unit time. Management example: customers arriving at a counter or call centre. Civil Engineering example: cars arriving at a traffic light. Finance and Insurance example: Number of Losses/Claims occurring in a given period of Time.

Earthquake Seismology example: An asymptotic Poisson model of seismic risk for large earthquakes. (Lomnitz, 1994). The Poisson distribution arises in connection with Poisson processes. It applies to various phenomena of discrete properties (that is, those that may happen 0, 1, 2, 3, ... times during a given period of time or in a given area) whenever the probability of the phenomenon happening is constant in time or space. Examples of events that may be modelled as a Poisson distribution include: The number of soldiers killed by horse-kicks each year in each corps in the Prussian cavalry. This example was made famous by a book of Ladislaus Josephovich Bortkiewicz (18681931). The number of yeast cells used when brewing Guinness beer. This example was made famous by William Sealy Gosset (18761937).[21] The number of phone calls arriving at a call centre per minute. The number of goals in sports involving two competing teams. The number of deaths per year in a given age group. The number of jumps in a stock price in a given time interval. Under an assumption of homogeneity, the number of times a web server is accessed per minute. The number of mutations in a given stretch of DNA after a certain amount of radiation. The proportion of cells that will be infected at a given multiplicity of infection.

Poisson distribution

25

How does this distribution arise? The law of rare events


In several of the above examplessuch as, the number of mutations in a given sequence of DNAthe events being counted are actually the outcomes of discrete trials, and would more precisely be modelled using the binomial distribution, that is

In such cases n is very large and p is very small (and so the expectation np is of intermediate magnitude). Then the distribution may be approximated by the less cumbersome Poisson distribution

This is sometimes known as the law of rare events, since each of the n individual Bernoulli events rarely occurs. The name may be misleading Comparison of the Poisson distribution (black lines) and the binomial distribution with because the total count of success n=10 (red circles), n=20 (blue circles), n=1000 (green circles). All distributions have a events in a Poisson process need not be mean of 5. The horizontal axis shows the number of events k. Notice that as n gets larger, rare if the parameter np is not small. the Poisson distribution becomes an increasingly better approximation for the binomial distribution with the same mean. For example, the number of telephone calls to a busy switchboard in one hour follows a Poisson distribution with the events appearing frequent to the operator, but they are rare from the point of view of the average member of the population who is very unlikely to make a call to that switchboard in that hour. The word law is sometimes used as a synonym of probability distribution, and convergence in law means convergence in distribution. Accordingly, the Poisson distribution is sometimes called the law of small numbers because it is the probability distribution of the number of occurrences of an event that happens rarely but has very many opportunities to happen. The Law of Small Numbers is a book by Ladislaus Bortkiewicz about the Poisson distribution, published in 1898. Some have suggested that the Poisson distribution should have been called the Bortkiewicz distribution.[22]

Multi-dimensional Poisson process


The poisson distribution arises as the distribution of counts of occurrences of events in (multidimensional) intervals in multidimensional Poisson processes in a directly equivalent way to the result for unidimensional processes. This,is D is any region the multidimensional space for which |D|, the area or volume of the region, is finite, and if N(D) is count of the number of events in D, then

Poisson distribution

26

Other applications in science


In a Poisson process, the number of observed occurrences fluctuates about its mean with a standard deviation . These fluctuations are denoted as Poisson noise or (particularly in electronics) as shot noise. The correlation of the mean and standard deviation in counting independent discrete occurrences is useful scientifically. By monitoring how the fluctuations vary with the mean signal, one can estimate the contribution of a single occurrence, even if that contribution is too small to be detected directly. For example, the charge e on an electron can be estimated by correlating the magnitude of an electric current with its shot noise. If N electrons pass a point in a given time t on the average, the mean current is ; since the current fluctuations should be of the order (i.e., the standard deviation of the Poisson process), the charge can be estimated from

the ratio . An everyday example is the graininess that appears as photographs are enlarged; the graininess is due to Poisson fluctuations in the number of reduced silver grains, not to the individual grains themselves. By correlating the graininess with the degree of enlargement, one can estimate the contribution of an individual grain (which is otherwise too small to be seen unaided). Many other molecular applications of Poisson noise have been developed, e.g., estimating the number density of receptor molecules in a cell membrane.

Generating Poisson-distributed random variables


A simple algorithm to generate random Poisson-distributed numbers (pseudo-random number sampling) has been given by Knuth (see References below): algorithm poisson random number (Knuth): init: Let L e, k 0 and p 1. do: k k + 1. Generate uniform random number u in [0,1] and let p p u. while p > L. return k 1. While simple, the complexity is linear in . There are many other algorithms to overcome this. Some are given in Ahrens & Dieter, see References below. Also, for large values of , there may be numerical stability issues because of the term e. One solution for large values of is Rejection sampling, another is to use a Gaussian approximation to the Poisson. Inverse transform sampling is simple and efficient for small values of , and requires only one uniform random number u per sample. Cumulative probabilities are examined in turn until one exceeds u.

Poisson distribution

27

Parameter estimation
Maximum likelihood
Given a sample of n measured values ki we wish to estimate the value of the parameter of the Poisson population from which the sample was drawn. The maximum likelihood estimate is

Since each observation has expectation so does this sample mean. Therefore the maximum likelihood estimate is an unbiased estimator of . It is also an efficient estimator, i.e. its estimation variance achieves the CramrRao lower bound (CRLB). Hence it is MVUE. Also it can be proved that the sample mean is a complete and sufficient statistic for .

Confidence interval
The confidence interval for a Poisson mean is calculated using the relationship between the Poisson and Chi-square distributions, and can be written as:

where k is the number of event occurrences in a given interval and


[19][23]

is the chi-square deviate with lower

tail area p and degrees of freedom n. This interval is 'exact' in the sense that its coverage probability is never less than the nominal 1 . When quantiles of the chi-square distribution are not available, an accurate approximation to this exact interval was proposed by DP Byar (based on the WilsonHilferty transformation):[24] , where denotes the standard normal deviate with upper tail area / 2.

For application of these formulae in the same context as above (given a sample of n measured values ki), one would set

calculate an interval for =n, and then derive the interval for .

Bayesian inference
In Bayesian inference, the conjugate prior for the rate parameter of the Poisson distribution is the gamma distribution. Let

denote that is distributed according to the gamma density g parameterized in terms of a shape parameter and an inverse scale parameter :

Then, given the same sample of n measured values ki as before, and a prior of Gamma(, ), the posterior distribution is

Poisson distribution The posterior mean E[] approaches the maximum likelihood estimate in the limit as .

28

The posterior predictive distribution for a single additional observation is a negative binomial distribution distribution, sometimes called a Gamma-Poisson distribution.

Bivariate Poisson distribution


This distribution has been extended to the bivariate case.[25] The generating function for this distribution is

with

The marginal distributions are Poisson( 1 ) and Poisson( 2 ) and the correlation coefficient is limited to the range

The Skellam distribution is a particular case of this distribution.

Notes
[1] Frank A. Haight (1967). Handbook of the Poisson Distribution. New York: John Wiley & Sons. [2] "Statistics | The Poisson Distribution" (http:/ / www. umass. edu/ wsp/ statistics/ lessons/ poisson/ index. html). Umass.edu. 2007-08-24. . Retrieved 2012-04-05. [3] Gullberg, Jan (1997). Mathematics from the birth of numbers. New York: W. W. Norton. pp.963965. ISBN0-393-04002-X. [4] S.D. Poisson, Probabilit des jugements en matire criminelle et en matire civile, prcdes des rgles gnrales du calcul des probabilitis (Paris, France: Bachelier, 1837), page 206 (http:/ / books. google. com/ books?id=uovoFE3gt2EC& pg=PA206#v=onepage& q& f=false). [5] Ladislaus von Bortkiewicz, Das Gesetz der kleinen Zahlen [The law of small numbers] (Leipzig, Germany: B.G. Teubner, 1898). On page 1 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA1#v=onepage& q& f=false), Bortkiewicz presents the Poisson distribution. On pages 23-25 (http:/ / books. google. com/ books?id=o_k3AAAAMAAJ& pg=PA23#v=onepage& q& f=false), Bortkiewicz presents his famous analysis of "4. Beispiel: Die durch Schlag eines Pferdes im preussischen Heere Getteten." (4. Example: Those killed in the Prussian army by a horse's kick.). [6] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p157 [7] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245251 [8] Riordan, John (1937). "Moment recurrence relations for binomial, Poisson and hypergeometric frequency distributions". Annals of Mathematical Statistics 8: 103111. Also see Haight (1967), p. 6. [9] E. L. Lehmann (1986). Testing Statistical Hypotheses (second ed.). New York: Springer Verlag. ISBN0-387-94919-4. page 65. [10] Raikov, D. (1937). On the decomposition of Poisson laws. Comptes Rendus (Doklady) de l' Academie des Sciences de l'URSS, 14, 911. (The proof is also given in von Mises, Richard (1964). Mathematical Theory of Probability and Statistics. New York: Academic Press.) [11] Laha, R. G. and Rohatgi, V. K.. Probability Theory. New York: John Wiley & Sons. p.233. ISBN0-471-03262-X. [12] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p159 [13] Massimo Franceschetti and Olivier Dousse and David N. C. Tse and Patrick Thiran (2007). "Closing the Gap in the Capacity of Wireless Networks Via Percolation Theory" (http:/ / circuit. ucsd. edu/ ~massimo/ Journal/ IEEE-TIT-Capacity. pdf). IEEE Transactions on Information Theory 53 (3): 10091018. . [14] NIST/SEMATECH, ' 6.3.3.1. Counts Control Charts (http:/ / www. itl. nist. gov/ div898/ handbook/ pmc/ section3/ pmc331. htm)', e-Handbook of Statistical Methods, accessed 25 October 2006 [15] Huiming, Zhang; Lili Chu,Yu Diao (2012). "Some Properties of the Generalized Stuttering Poisson Distribution and its Applications". Studies in Mathematical Sciences 5 (1): 1126. doi:10.3968/j.sms.1923845220120501.Z0697. [16] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models. London: Chapman and Hall. ISBN0-412-31760-5. page 196 gives the approximation and higher order terms. [17] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p163 [18] S. M. Ross (2007). Introduction to Probability Models (ninth ed.). Boston: Academic Press. ISBN978-0-12-598062-3. pp. 307308. [19] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p171 [20] Johnson, N.L., Kotz, S., Kemp, A.W. (1993) Univariate Discrete distributions (2nd edition). Wiley. ISBN 0-471-54897-9, p153 [21] Philip J. Boland. "A Biographical Glimpse of William Sealy Gosset" (http:/ / wfsc. tamu. edu/ faculty/ tdewitt/ biometry/ Boland PJ (1984) American Statistician 38 179-183 - A biographical glimpse of William Sealy Gosset. pdf). The American Statistician, Vol. 38, No. 3. (Aug., 1984), pp. 179-183.. . Retrieved 2011-06-22. "At the turn of the 19th century, Arthur Guinness, Son & Co. became interested in hiring scientists to analyze data concerned with various aspects of its brewing process. Gosset was to be one of the first of these scientists, and so it

Poisson distribution
was that in 1899 he moved to Dublin to take up a job as a brewer at St. James' Gate... Student published 22 papers, the first of which was entitled "On the Error of Counting With a Haemacytometer" (Biometrika, 1907). In it, Student illustrated the practical use of the Poisson distribution in counting the number of yeast cells on a square of a haemacytometer. Up until just before World War II, Guinness would not allow its employees to publish under their own names, and hence Gosset chose to write under the pseudonym of "Student."" [22] Good, I. J. (1986). "Some statistical applications of Poisson's work". Statistical Science 1 (2): 157180. doi:10.1214/ss/1177013690. JSTOR2245435. [23] Garwood, F. (1936). "Fiducial Limits for the Poisson Distribution". Biometrika 28 (3/4): 437442. doi:10.1093/biomet/28.3-4.437. [24] Breslow, NE; Day, NE (1987). Statistical Methods in Cancer Research: Volume 2The Design and Analysis of Cohort Studies (http:/ / www. iarc. fr/ en/ publications/ pdfs-online/ stat/ sp82/ index. php). Paris: International Agency for Research on Cancer. ISBN978-92-832-0182-3. . [25] Loukas S, Kemp CD (1986) The index of dispersion test for the bivariate Poisson distribution. Biometrics 42(4) 941-948

29

References
Joachim H. Ahrens, Ulrich Dieter (1974). "Computer Methods for Sampling from Gamma, Beta, Poisson and Binomial Distributions". Computing 12 (3): 223246. doi:10.1007/BF02293108. Joachim H. Ahrens, Ulrich Dieter (1982). "Computer Generation of Poisson Deviates". ACM Transactions on Mathematical Software 8 (2): 163179. doi:10.1145/355993.355997. Ronald J. Evans, J. Boersma, N. M. Blachman, A. A. Jagers (1988). "The Entropy of a Poisson Distribution: Problem 87-6". SIAM Review 30 (2): 314317. doi:10.1137/1030059. Donald E. Knuth (1969). Seminumerical Algorithms. The Art of Computer Programming, Volume 2. Addison Wesley.

Beta-binomial distribution

30

Beta-binomial distribution
Probability mass function

Cumulative distribution function

Parameters n N0 number of trials (real) (real) Support PMF CDF where 3F2(a,b,k) is the generalized hypergeometric function =3F2(1,+k+1,n+k+1;k+2,n+k+2;1) Mean Variance Skewness Ex. kurtosis See text MGF CF k { 0, , n }

In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bernoulli trials is either unknown or random. It is frequently used in Bayesian statistics, empirical Bayes methods and classical statistics as an overdispersed binomial distribution.

Beta-binomial distribution It reduces to the Bernoulli distribution as a special case when n=1. For ==1, it is the discrete uniform distribution from 0 ton. It also approximates the binomial distribution arbitrarily well for large and. The beta-binomial is a one-dimensional version of the Dirichlet-multinomial distribution, as the binomial and beta distributions are special cases of the multinomial and Dirichlet distributions, respectively.

31

Motivation and derivation


Beta-binomial distribution as a compound distribution
The Beta distribution is a conjugate distribution of the binomial distribution. This fact leads to an analytically tractable compound distribution where one can think of the parameter in the binomial distribution as being randomly drawn from a beta distribution. Namely, if

is the binomial distribution where p is a random variable with a beta distribution

then the compound distribution is given by

Using the properties of the beta function, this can alternatively be written

It is within this context that the beta-binomial distribution appears often in Bayesian statistics: the beta-binomial is the predictive distribution of a binomial random variable with a beta distribution prior on the success probability.

Beta-binomial as an urn model


The beta-binomial distribution can also be motivated via an urn model for positive integer values of and . Specifically, imagine an urn containing red balls and black balls, where random draws are made. If a red ball is observed, then two red balls are returned to the urn. Likewise, if a black ball is drawn, it is replaced and another black ball is added to the urn. If this is repeated n times, then the probability of observing k red balls follows a beta-binomial distribution with parameters n, and . Note that if the random draws are with simple replacement (no balls over and above the observed ball are added to the urn), then the distribution follows a binomial distribution and if the random draws are made without replacement, the distribution follows a hypergeometric distribution.

Beta-binomial distribution

32

Moments and properties


The first three raw moments are

and the kurtosis is

Letting

we note, suggestively, that the mean can be written as

and the variance as

where parameter.

is the pairwise correlation between the n Bernoulli draws and is called the over-dispersion

Point estimates
Method of moments
The method of moments estimates can be gained by noting the first and second moments of the beta-binomial namely

and setting these raw moments equal to the sample moments

and solving for and we get

Note that these estimates can be non-sensically negative which is evidence that the data is either undispersed or underdispersed relative to the binomial distribution. In this case, the binomial distribution and the hypergeometric distribution are alternative candidates respectively.

Beta-binomial distribution

33

Maximum likelihood estimation


While closed-form maximum likelihood estimates are impractical, given that the pdf consists of common functions (gamma function and/or Beta functions), they can be easily found via direct numerical optimization. Maximum likelihood estimates from empirical data can be computed using general methods for fitting multinomial Plya distributions, methods for which are described in (Minka 2003). The R package VGAM through the function vglm, via maximum likelihood, facilitates the fitting of glm type models with responses distributed according to the beta-binomial distribution. Note also that there is no requirement that n is fixed throughout the observations.

Example
The following data gives the number of male children among the first 12 children of family size 13 in 6115 families taken from hospital records in 19th century Saxony (Sokal and Rohlf, p.59 from Lindsey). The 13th child is ignored to assuage the effect of families non-randomly stopping when a desired gender is reached.
Males 0 1 2 3 4 5 6 7 8 9 10 11 12

Families 3 24 104 286 670 1033 1343 1112 829 478 181 45 7

We note the first two sample moments are

and therefore the method of moments estimates are

The maximum likelihood estimates can be found numerically

and the maximized log-likelihood is

from which we find the AIC

The AIC for the competing binomial model is AIC=25070.34 and thus we see that the beta-binomial model provides a superior fit to the data i.e. there is evidence for overdispersion. Trivers and Willard posit a theoretical justification for heterogeneity in gender-proneness among families (i.e. overdispersion). The superior fit is evident especially among the tails

Beta-binomial distribution

34

Males Observed Families Predicted (Beta-Binomial)

0 3

1 24

2 104

3 286

4 670

5 1033

6 1343

7 1112

8 829

9 478

10 181

11 45

12 7

2.3 22.6 104.8 310.9 655.7 1036.2 1257.9 1182.1 853.6 461.9 177.9 43.8 5.2 258.5 628.1 1085.2 1367.3 1265.6 854.2 410.0 132.8 26.1 2.3

Predicted (Binomial p = 0.519215) 0.9 12.1 71.8

Further Bayesian considerations


It is convenient to reparameterize the distributions so that the expected mean of the prior is a single parameter: Let

where

so that

The posterior distribution (|k) is also a beta distribution:

And

while the marginal distribution m(k|, M) is given by

Because the marginal is a complex, non-linear function of Gamma and Digamma functions, it is quite difficult to obtain a marginal maximum likelihood estimate (MMLE) for the mean and variance. Instead, we use the method of iterated expectations to find the expected value of the marginal moments. Let us write our model as a two-stage compound sampling model. Let ki be the number of success out of ni trials for event i:

Beta-binomial distribution We can find iterated moment estimates for the mean and variance using the moments for the distributions in the two-stage model:

35

(Here we have used the law of total expectation and the law of total variance.) We want point estimates for and . The estimated mean is calculated from the sample

The estimate of the hyperparameter M is obtained using the moment estimates for the variance of the two-stage model:

Solving:

where

Since we now have parameter point estimates, point estimate estimate and

and

, for the underlying distribution, we would like to find a

for the probability of success for event i. This is the weighted average of the event . Given our point estimates for the prior, we may now plug in these values to find a point

estimate for the posterior

Shrinkage factors
We may write the posterior estimate as a weighted average:

where

is called the shrinkage factor.

Beta-binomial distribution

36

Related distributions
where is the discrete uniform distribution.

References
* Minka, Thomas P. (2003). Estimating a Dirichlet distribution [1]. Microsoft Technical Report.

External links
Using the Beta-binomial distribution to assess performance of a biometric identification device [2] Fastfit [3] contains Matlab code for fitting Beta-Binomial distributions (in the form of two-dimensional Plya distributions) to data.

References
[1] http:/ / research. microsoft. com/ ~minka/ papers/ dirichlet/ [2] http:/ / it. stlawu. edu/ ~msch/ biometrics/ papers. htm [3] http:/ / research. microsoft. com/ ~minka/ software/ fastfit/

Negative binomial distribution

37

Negative binomial distribution


Different texts adopt slightly different definitions for the negative binomial distribution. They can be distinguished by whether the support starts at k=0 or at k=r, and whether p denotes the probability of a success or of a failure. Probability mass function

The orange line represents the mean, which is equal to 10 in each of these plots; the green line shows the standard deviation. Notation Parameters r > 0 number of failures until the experiment is stopped (integer, but the definition can also be extended to reals) p (0,1) success probability in each experiment (real) k { 0, 1, 2, 3, } number of successes involving a binomial coefficient the regularized incomplete beta function

Support PMF CDF Mean Mode

Variance Skewness Ex. kurtosis MGF CF PGF

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution of the number of successes in a sequence of Bernoulli trials before a specified (non-random) number of failures (denoted r) occur. For example, if one throws a die repeatedly until the third time 1 appears, then the probability distribution of the number of non-1s that had appeared will be negative binomial. The Pascal distribution (after Blaise Pascal) and Polya distribution (for George Plya) are special cases of the negative binomial. There is a convention among engineers, climatologists, and others to reserve negative binomial in a strict sense or Pascal for the case of an integer-valued stopping-time parameter r, and use Polya for the real-valued case. The Polya distribution more accurately models occurrences of contagious discrete events, like

Negative binomial distribution tornado outbreaks, than the Poisson distribution by allowing the mean and variance to be different, unlike the Poisson. Contagious events have positively correlated occurrences causing a larger variance than if the occurrences were independent, due to a positive covariance term.

38

Definition
Suppose there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called success and failure. In each trial the probability of success is p and of failure is (1 p). We are observing this sequence until a predefined number r of failures has occurred. Then the random number of successes we have seen, X, will have the negative binomial (or Pascal) distribution:

When applied to real-world problems, outcomes of success and failure may or may not be outcomes we ordinarily view as good and bad, respectively. Suppose we used the negative binomial distribution to model the number of days a certain machine works before it breaks down. In this case success would be the result on a day when the machine worked properly, whereas a breakdown would be a failure. If we used the negative binomial distribution to model the number of goal attempts a sportsman makes before scoring a goal, though, then each unsuccessful attempt would be a success, and scoring a goal would be failure. If we are tossing a coin, then the negative binomial distribution can give the number of heads (success) we are likely to encounter before we encounter a certain number of tails (failure). The probability mass function of the negative binomial distribution is

Here the quantity in parentheses is the binomial coefficient, and is equal to

This quantity can alternatively be written in the following manner, explaining the name negative binomial:

To understand the above definition of the probability mass function, note that the probability for every specific sequence of ksuccesses and rfailures is (1 p)rpk, because the outcomes of the k+r trials are supposed to happen independently. Since the rthfailure comes last, it remains to choose the ktrials with successes out of the remaining k+r1 trials. The above binomial coefficient, due to its combinatorial interpretation, gives precisely the number of all these sequences of length k+r1.

Extension to real-valued r
It is possible to extend the definition of the negative binomial distribution to the case of a positive real parameter r. Although it is impossible to visualize a non-integer number of failures, we can still formally define the distribution through its probability mass function. As before, we say that X has a negative binomial (or Plya) distribution if it has a probability mass function:

Here r is a real, positive number. The binomial coefficient is then defined by the multiplicative formula and can also be rewritten using the gamma function:

Negative binomial distribution

39

Note that by the binomial series and (*) above, for every 0 p < 1,

hence the terms of the probability mass function indeed add up to one.

Alternative formulations
Some textbooks may define the negative binomial distribution slightly differently than it is done here. The most common variations are: The definition where X is the total number of trials needed to get r failures, not simply the number of successes. Since the total number of trials is equal to the number of successes plus the number of failures, this definition differs from ours by adding constantr. In order to convert formulas written with this definition into the one used in the article, replace everywhere k with k - r, and also subtract r from the mean, the median, and the mode. In order to convert formulas of this article into this alternative definition, replace k with k + r and add r to the mean, the median and the mode. Effectively, this implies using the probability mass function

which perhaps resembles the binomial distribution more closely than the version above. Note that the arguments of the binomial coefficient are decremented due to order: the last "failure" must occur last, and so the other events have one fewer positions available when counting possible orderings. Note that this definition of the negative binomial distribution does not easily generalize to a positive, real parameterr. The definition where p denotes the probability of a failure, not of a success. In order to convert formulas between this definition and the one used in the article, replace p with 1 p everywhere. The definition where the support X is defined as the number of failures, rather than the number of successes. This definition where X counts failures but p is the probability of success has exactly the same formulas as in the previous case where X counts successes but p is the probability of failure. However, the corresponding text will have the words failure and success swapped compared with the previous case. The two alterations above may be applied simultaneously, i.e. X counts total trials, and p is the probability of failure.

Occurrence
Waiting time in a Bernoulli process
For the special case where r is an integer, the negative binomial distribution is known as the Pascal distribution. It is the probability distribution of a certain number of failures and successes in a series of independent and identically distributed Bernoulli trials. For k+r Bernoulli trials with success probability p, the negative binomial gives the probability of k successes and r failures, with a failure on the last trial. In other words, the negative binomial distribution is the probability distribution of the number of successes before the rth failure in a Bernoulli process, with probability p of successes on each trial. A Bernoulli process is a discrete time process, and so the number of trials, failures, and successes are integers. Consider the following example. Suppose we repeatedly throw a die, and consider a 1 to be a failure. The probability of failure on each trial is 1/6. The number of successes before the third failure belongs to the infinite set { 0,1,2,3,... }. That number of successes is a negative-binomially distributed random variable.

Negative binomial distribution When r = 1 we get the probability distribution of number of successes before the first failure (i.e. the probability of the first failure occurring on the (k+1)st trial), which is a geometric distribution:

40

Overdispersed Poisson
The negative binomial distribution, especially in its alternative parameterization described above, can be used as an alternative to the Poisson distribution. It is especially useful for discrete data over an unbounded positive range whose sample variance exceeds the sample mean. In such cases, the observations are overdispersed with respect to a Poisson distribution, for which the mean is equal to the variance. Hence a Poisson distribution is not an appropriate model. Since the negative binomial distribution has one more parameter than the Poisson, the second parameter can be used to adjust the variance independently of the mean. See Cumulants of some discrete probability distributions. An application of this is to annual counts of tropical cyclones in the North Atlantic or to monthly to 6-monthly counts of wintertime extratropical cyclones over Europe, for which the variance is greater than the mean.[1][2][3] In the case of modest overdispersion, this may produce substantially similar results to an overdispersed Poisson distribution.[4][5]

Related distributions
The geometric distribution (on {0,1,2,3,...}) is a special case of the negative binomial distribution, with

The negative binomial distribution is a special case of the discrete phase-type distribution. The negative binomial distribution is a special case of the stuttering Poisson distribution.[6]

Poisson distribution
Consider a sequence of negative binomial distributions where the stopping parameter r goes to infinity, whereas the probability of success in each trial, p, goes to zero in such a way as to keep the mean of the distribution constant. Denoting this mean , the parameter p will have to be

Under this parametrization the probability mass function will be

Now if we consider the limit as r , the second factor will converge to one, and the third to the exponent function:

which is the mass function of a Poisson-distributed random variable with expected value. In other words, the alternatively parameterized negative binomial distribution converges to the Poisson distribution and r controls the deviation from the Poisson. This makes the negative binomial distribution suitable as a robust alternative to the Poisson, which approaches the Poisson for large r, but which has larger variance than the Poisson for small r.

Negative binomial distribution

41

GammaPoisson mixture
The negative binomial distribution also arises as a continuous mixture of Poisson distributions (i.e. a compound probability distribution) where the mixing distribution of the Poisson rate is a gamma distribution. That is, we can view the negative binomial as a Poisson() distribution, where is itself a random variable, distributed according to Gamma(r, p/(1 p)). Formally, this means that the mass function of the negative binomial distribution can be written as

Because of this, the negative binomial distribution is also known as the gammaPoisson (mixture) distribution.

Sum of geometric distributions


If Yr is a random variable following the negative binomial distribution with parameters r and p, and support {0,1,2,...}, then Yr is a sum of r independent variables following the geometric distribution (on {0,1,2,3,...}) with parameter 1p. As a result of the central limit theorem, Yr (properly scaled and shifted) is therefore approximately normal for sufficiently larger. Furthermore, if Bs+r is a random variable following the binomial distribution with parameters s+r and1p, then

In this sense, the negative binomial distribution is the "inverse" of the binomial distribution. The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for parameter p is negative-binomially distributed with the same p but with "r-value"r1+r2. The negative binomial distribution is infinitely divisible, i.e., if Y has a negative binomial distribution, then for any positive integer n, there exist independent identically distributed random variables Y1,...,Yn whose sum has the same distribution that Y has.

Negative binomial distribution

42

Representation as compound Poisson distribution


The negative binomial distribution NB(r,p) can be represented as a compound Poisson distribution: Let {Yn, n 0} denote a sequence of independent and identically distributed random variables, each one having the logarithmic distribution Log(p), with probability mass function

Let N be a random variable, independent of the sequence, and suppose that N has a Poisson distribution with parameter = r ln(1 p). Then the random sum

is NB(r,p)-distributed. To prove this, we calculate the probability generating function GX of X, which is the composition of the probability generating functions GN and GY1. Using and

we obtain

which is the probability generating function of the NB(r,p) distribution.

Properties
Cumulative distribution function
The cumulative distribution function can be expressed in terms of the regularized incomplete beta function:

Sampling and point estimation of p


Suppose p is unknown and an experiment is conducted where it is decided ahead of time that sampling will continue until r successes are found. A sufficient statistic for the experiment is k, the number of failures. In estimating p, the minimum variance unbiased estimator is

The maximum likelihood estimate of p is

but this is a biased estimate. Its inverse (r+k)/r, is an unbiased estimate of 1/p, however.[7]

Negative binomial distribution

43

Relation to the binomial theorem


Suppose Y is a random variable with a binomial distribution with parameters n and p. Assume p + q = 1, with p, q >=0. Then the binomial theorem implies that

Using Newton's binomial theorem, this can equally be written as:

in which the upper bound of summation is infinite. In this case, the binomial coefficient

is defined when n is a real number, instead of just a positive integer. But in our case of the binomial distribution it is zero when k > n. We can then say, for example

Now suppose r > 0 and we use a negative exponent:

Then all of the terms are positive, and the term

is just the probability that the number of failures before the rth success is equal to k, provided r is an integer. (If r is a negative non-integer, so that the exponent is a positive non-integer, then some of the terms in the sum above are negative, so we do not have a probability distribution on the set of all nonnegative integers.) Now we also allow non-integer values of r. Then we have a proper negative binomial distribution, which is a generalization of the Pascal distribution, which coincides with the Pascal distribution when r happens to be a positive integer. Recall from above that The sum of independent negative-binomially distributed random variables r1 and r2 with the same value for parameter p is negative-binomially distributed with the same p but with "r-value"r1+r2. This property persists when the definition is thus generalized, and affords a quick way to see that the negative binomial distribution is infinitely divisible.

Negative binomial distribution

44

Parameter estimation
Maximum likelihood estimation
The likelihood function for N iid observations (k1,...,kN) is

from which we calculate the log-likelihood function

To find the maximum we take the partial derivatives with respect to r and p and set them equal to zero: and

where is the digamma function. Solving the first equation for p gives:

Substituting this in the second equation gives:

This equation cannot be solved in closed form. If a numerical solution is desired, an iterative technique such as Newton's method can be used.

Examples
Selling candy
Pat is required to sell candy bars to raise money for the 6th grade field trip. There are thirty houses in the neighborhood, and Pat is not supposed to return home until five candy bars have been sold. So the child goes door to door, selling candy bars. At each house, there is a 0.4 probability of selling one candy bar and a 0.6 probability of selling nothing. What's the probability of selling the last candy bar at the nth house? Recall that the NegBin(r, p) distribution describes the probability of k failures and r successes in k+r Bernoulli(p) trials with success on the last trial. Selling five candy bars means getting five successes. The number of trials (i.e. houses) this takes is therefore k+5=n. The random variable we are interested in is the number of houses, so we substitute k=n5 into a NegBin(5,0.4) mass function and obtain the following mass function of the distribution of houses (for n5):

What's the probability that Pat finishes on the tenth house?

Negative binomial distribution

45

What's the probability that Pat finishes on or before reaching the eighth house? To finish on or before the eighth house, Pat must finish at the fifth, sixth, seventh, or eighth house. Sum those probabilities:

What's the probability that Pat exhausts all 30 houses in the neighborhood? This can be expressed as the probability that Pat does not finish on the fifth through the thirtieth house:

Polygyny in African societies


Data on polygyny among a wide range of traditional African societies suggest that the distribution of wives follow a range of binomial profiles. The majority of these are negative binomial indicating the degree of competition for wives. However some tend towards a Poisson Distribution and even beyond towards a true binomial, indicating a degree of conformity in the allocation of wives. Further analysis of these profiles indicates shifts along this continuum between more competitiveness or more conformity according to the age of the husband and also according to the status of particular sectors within a society. In this way, these binomial distributions provide a tool for comparison, between societies, between sectors of societies, and over time.[8]

References
[1] Villarini, G.; Vecchi, G.A. and Smith, J.A. (2010). "Modeling of the dependence of tropical storm counts in the North Atlantic Basin on climate indices". Monthly Weather Review 138 (7): 26812705. doi:10.1175/2010MWR3315.1. [2] Mailier, P.J.; Stephenson, D.B.; Ferro, C.A.T.; Hodges, K.I. (2006). "Serial Clustering of Extratropical Cyclones". Monthly Weather Review 134 (8): 22242240. doi:10.1175/MWR3160.1. [3] Vitolo, R.; Stephenson, D.B.; Cook, Ian M.; Mitchell-Wallace, K. (2009). "Serial clustering of intense European storms". Meteorologische Zeitschrift 18 (4): 411424. doi:10.1127/0941-2948/2009/0393. [4] McCullagh, Peter; Nelder, John (1989). Generalized Linear Models, Second Edition. Boca Raton: Chapman and Hall/CRC. ISBN0-412-31760-5. [5] Cameron, Adrian C.; Trivedi, Pravin K. (1998). Regression analysis of count data. Cambridge University Press. ISBN0-521-63567-5. [6] Huiming, Zhang; Lili Chu,Yu Diao (2012). "Some Properties of the Generalized Stuttering Poisson Distribution and its Applications". Studies in Mathematical Sciences 5 (1): 1126. doi:10.3968/j.sms.1923845220120501.Z0697. [7] J. B. S. Haldane, "On a Method of Estimating Frequencies", Biometrika, Vol. 33, No. 3 (Nov., 1945), pp. 222225. JSTOR2332299 [8] Spencer, Paul, 1998, The Pastoral Continuum: the Marginalization of Tradition in East Africa, Clarendon Press, Oxford (pp. 51-92).

Further reading
Hilbe, Joseph M., Negative Binomial Regression, Cambridge, UK: Cambridge University Press (2007) Negative Binomial Regression Cambridge University Press (http://www.cambridge.org/uk/catalogue/catalogue. asp?isbn=9780521857727)

Geometric distribution

46

Geometric distribution
In probability theory and statistics, the geometric distribution is either of two discrete probability distributions: The probability distribution of the number of X Bernoulli trials needed to get one success, supported on the set{1,2,3,...} The probability distribution of the number Y=X1 of failures before the first success, supported on the set{0,1,2,3,...} Which of these one calls "the" geometric distribution is a matter of convention and convenience.
Geometric Probability mass function

Cumulative distribution function

Parameters Support Probability mass function (pmf) Cumulative distribution function (cdf) Mean Median

success probability (real)

success probability (real)

(not unique if is an integer)

(not unique if is an integer)

Geometric distribution

47

Mode Variance Skewness Excess kurtosis Entropy Moment-generating function (mgf) for Characteristic function ,

These two different geometric distributions should not be confused with each other. Often, the name shifted geometric distribution is adopted for the former one (distribution of the number X); however, to avoid ambiguity, it is considered wise to indicate which is intended, by mentioning the range explicitly. Its the probability that the first occurrence of success require k number of independent trials, each with success probability p.If the probability of success on each trial is p, then the probability that the kth trial (out of k trials) is the first success is

for k = 1, 2, 3, .... The above form of geometric distribution is used for modeling the number of trials until the first success. By contrast, the following form of geometric distribution is used for modeling number of failures until the first success:

fork=0,1,2,3,.... In either case, the sequence of probabilities is a geometric sequence. For example, suppose an ordinary die is thrown repeatedly until the first time a "1" appears. The probability distribution of the number of times it is thrown is supported on the infinite set {1,2,3,...} and is a geometric distribution with p=1/6.

Moments and cumulants


The expected value of a geometrically distributed random variable X is 1/p and the variance is (1p)/p2:

Similarly, the expected value of the geometrically distributed random variable Y is (1p)/p, and its variance is (1p)/p2:

Let = (1p)/p be the expected value of Y. Then the cumulants recursion

of the probability distribution of Y satisfy the

Outline of proof: That the expected value is (1p)/p can be shown in the following way. Let Y be as above. Then

Geometric distribution

48

(The interchange of summation and differentiation is justified by the fact that convergent power series converge uniformly on compact subsets of the set of points where they converge.)

Parameter estimation
For both variants of the geometric distribution, the parameter p can be estimated by equating the expected value with the sample mean. This is the method of moments, which in this case happens to yield maximum likelihood estimates of p. Specifically, for the first variant let k=k1,...,kn be a sample where ki1 for i=1,...,n. Then p can be estimated as

In Bayesian inference, the Beta distribution is the conjugate prior distribution for the parameter p. If this parameter is given a Beta(,) prior, then the posterior distribution is

The posterior mean E[p] approaches the maximum likelihood estimate

as and approach zero.

In the alternative case, let k1,...,kn be a sample where ki0 for i=1,...,n. Then p can be estimated as

The posterior distribution of p given a Beta(,) prior is

Again the posterior mean E[p] approaches the maximum likelihood estimate the properties of "Y", 'P is parallel to D5. The probability-generating functions of X and Y are, respectively,

as and approach zero.

Like its continuous analogue (the exponential distribution), the geometric distribution is memoryless. That means that if you intend to repeat an experiment until the first success, then, given that the first success has not yet occurred, the conditional probability distribution of the number of additional trials does not depend on how many failures have been observed. The die one throws or the coin one tosses does not have a "memory" of these

Geometric distribution failures. The geometric distribution is in fact the only memoryless discrete distribution. Among all discrete probability distributions supported on {1,2,3,...} with given expected value, the geometric distribution X with parameter p=1/ is the one with the largest entropy. The geometric distribution of the number Y of failures before the first success is infinitely divisible, i.e., for any positive integer n, there exist independent identically distributed random variables Y1,...,Yn whose sum has the same distribution that Y has. These will not be geometrically distributed unless n=1; they follow a negative binomial distribution. The decimal digits of the geometrically distributed random variable Y are a sequence of independent (and not identically distributed) random variables. For example, the hundreds digit D has this probability distribution:

49

where q=1p, and similarly for the other digits, and, more generally, similarly for numeral systems with other bases than 10. When the base is 2, this shows that a geometrically distributed random variable can be written as a sum of independent random variables whose probability distributions are indecomposable. Golomb coding is the optimal prefix code for the geometric discrete distribution.

Related distributions
The geometric distribution Y is a special case of the negative binomial distribution, with r=1. More generally, if Y1,...,Yr are independent geometrically distributed variables with parameterp, then the sum

follows a negative binomial distribution with parameters rand1-p.[1] If Y1,...,Yr are independent geometrically distributed variables (with possibly different success parameters pm), then their minimum

is also geometrically distributed, with parameter Suppose 0<r<1, and for k=1,2,3,... the random variable Xk has a Poisson distribution with expected value rk/k. Then

has a geometric distribution taking values in the set {0,1,2,...}, with expected value r/(1r). The exponential distribution is the continuous analogue of the geometric distribution. If X is an exponentially distributed random variable with parameter, then

where

is the floor (or greatest integer) function, is a geometrically distributed random variable with

parameter p=1e (thus =ln(1p)[2]) and taking values in the set{0,1,2,...}. This can be used to generate geometrically distributed pseudorandom numbers by first generating exponentially distributed pseudorandom numbers from a uniform pseudorandom number generator: then is geometrically distributed with parameter , if is uniformly distributed in [0,1].

Geometric distribution

50

References
[1] Pitman, Jim. Probability (1993 edition). Springer Publishers. pp 372. [2] http:/ / www. wolframalpha. com/ input/ ?i=inverse+ p+ %3D+ 1+ -+ e^-l

External links
Geometric distribution (http://planetmath.org/?op=getobj&amp;from=objects&amp;id=3456), PlanetMath.org. Geometric distribution (http://mathworld.wolfram.com/GeometricDistribution.html) on MathWorld. Online geometric distribution calculator (http://www.solvemymath.com/online_math_calculator/statistics/ discrete_distributions/geometric/index.php)

Multinomial distribution

51

Multinomial distribution
Multinomial Parameters Support PMF Mean Variance MGF number of trials (integer) event probabilities (

CF where PGF

In probability theory, the multinomial distribution is a generalization of the binomial distribution. The binomial distribution is the probability distribution of the number of "successes" in n independent Bernoulli trials, with the same probability of "success" on each trial. In a multinomial distribution, the analog of the Bernoulli distribution is the categorical distribution, where each trial results in exactly one of some fixed finite number k of possible outcomes, with probabilities p1, ..., pk (so that pi0 for i=1,...,k and ), and there are n

independent trials. Then let the random variables Xi indicate the number of times outcome number i was observed over the n trials. The vector X=(X1,...,Xk) follows a multinomial distribution with parameters n and p, where p=(p1,...,pk). Note that, in some fields, such as natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a categorical distribution is actually meant. This stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range ; in this form, a categorical distribution is equivalent to a multinomial distribution over a single observation.

Multinomial distribution

52

Specification
Probability mass function
Suppose you make an experiment of extracting n balls of k different categories from a bag, replacing the extracted ball after each draw. Balls from the same category are equal. If we denote Xi the number of balls extracted and pi the probability, both of category i, the probability mass function of this multinomial distribution is:

for non-negative integers x1, ..., xk.

Visualization
As slices of generalized Pascal's triangle
Just like one can interpret the binomial distribution as (normalized) 1D slices of Pascal's triangle, so too can one interpret the multinomial distribution as 2D (triangular) slices of Pascal's pyramid, or 3D/4D/+ (pyramid-shaped) slices of higher-dimensional analogs of Pascal's triangle. This reveals an interpretation of the range of the distribution: discretized equilaterial "pyramids" in arbitrary dimension -- i.e. a simplex with a grid.

As polynomial coefficients
Similarly, just like one can interpret the binomial distribution as the polynomial coefficients of when expanded, one can interpret the multinomial distribution as the coefficients of when expanded. (Note that just like the binomial distribution, the coefficients must sum to 1.) This is the origin of the name "multinomial distribution".

Properties
The expected number of times the outcome i was observed over n trials is

The covariance matrix is as follows. Each diagonal entry is the variance of a binomially distributed random variable, and is therefore

The off-diagonal entries are the covariances:

for i, j distinct. All covariances are negative because for fixed n, an increase in one component of a multinomial vector requires a decrease in another component. This is a k k positive-semidefinite matrix of rank k1. In the special case where k=n and where the pi are all equal, the covariance matrix is the centering matrix. The entries of the corresponding correlation matrix are

Multinomial distribution

53

Note that the sample size drops out of this expression. Each of the k components separately has a binomial distribution with parameters n and pi, for the appropriate value of the subscript i. The support of the multinomial distribution is the set

Its number of elements is

the number of n-combinations of a multiset with k types, or multiset coefficient.

Example
In a recent three-way election for a large country, candidate A received 20% of the votes, candidate B received 30% of the votes, and candidate C received 50% of the votes. If six voters are selected randomly, what is the probability that there will be exactly one supporter for candidate A, two supporters for candidate B and three supporters for candidate C in the sample? Note: Since were assuming that the voting population is large, it is reasonable and permissible to think of the probabilities as unchanging once a voter is selected for the sample. Technically speaking this is sampling without replacement, so the correct distribution is the multivariate hypergeometric distribution, but the distributions converge as the population grows large.

Sampling from a multinomial distribution


First, reorder the parameters such that they are sorted in descending order (this is only to speed up computation and not strictly necessary). Now, for each trial, draw an auxiliary variable X from a uniform (0,1) distribution. The resulting outcome is the component

This is a sample for the multinomial distribution with n=1. A sum of independent repetitions of this experiment is a sample from a multinomial distribution with n equal to the number of such repetitions.

To simulate a multinomial distribution


Various methods may be used to simulate a multinomial distribution. A very simple one is to use a random number generator to generate numbers between 0 and 1. First, we divide the interval from 0 to 1 in k subintervals equal in size to the probabilities of the k categories. Then, we generate a random number for each of n trials and use a logical test to classify the virtual measure or observation in one of the categories. Example If we have :

Multinomial distribution

54

Categories Probabilities

0.15 0.20 0.30 0.16 0.12 0.07

Superior limits of subintervals 0.15 0.35 0.65 0.81 0.93 1.00

Then, with a software like Excel, we may use the following receipt:
Cells : Ai Bi Ci ... Gi

Formulae : Alea() =If($Ai<0.15;1;0) =If(And($Ai>=0.15;$Ai<0.35);1;0) ... =If($Ai>=0.93;1;0)

After that, we will use functions such as SumIf to accumulate the observed results by category and to calculate the estimated covariance matrix for each simulated sample. Another way with Excel, is to use the discrete random number generator. In that case, the categories must be label or relabel with numeric values. In the two cases, the result is a multinomial distribution with k categories without any correlation. This is equivalent, with a continuous random distribution, to simulate k independent standardized normal distributions, or a multinormal distribution N(0,I) having k components identically distributed and statistically independent.

Related distributions
When k = 2, the multinomial distribution is the binomial distribution. The continuous analogue is Multivariate normal distribution. Categorical distribution, the distribution of each trial; for k = 2, this is the Bernoulli distribution. The Dirichlet distribution is the conjugate prior of the multinomial in Bayesian statistics. Dirichlet-multinomial distribution. Beta-binomial model.

References
Evans, Merran; Hastings, Nicholas; Peacock, Brian (2000). Statistical Distributions. New York: Wiley. pp.134136. ISBN0-471-37124-6. 3rd ed..

Categorical distribution

55

Categorical distribution
Categorical Parameters Support PMF (1) (2)

number of categories (integer) event probabilities (

is the Iverson bracket

CDF

Mean Median

Mode Variance MGF CF PGF

where

In probability theory and statistics, a categorical distribution (occasionally "discrete distribution" or "multinomial distribution", both imprecise usages) is a probability distribution that describes the result of a random event that can take on one of K possible outcomes, with the probability of each outcome separately specified. There is not necessarily an underlying ordering of these outcomes, but numerical labels are attached for convenience in describing the distribution, often in the range 1 to K. Note that the K-dimensional categorical distribution is the most general distribution over a K-way event; any other discrete distribution over a size-K sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1. The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes.

Categorical distribution

56

Terminology
Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions. Note that, in some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a categorical distribution is actually meant.[1] This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below). However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichlet-multinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a Hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The joint distribution of the same variables with the same Dirichlet-multinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomial-style counts of nodes in each particular category (similar to the distinction between a set of Bernoulli-distributed nodes and a single binomial-distributed node). Both forms have very similar-looking probability mass functions (PMF's), which both make reference to multinomial-style counts of nodes in a category. However, the multinomial-style PMF has an extra factor, a multinomial coefficient, that is not present in the categorical-style PMF. Confusing the two can easily lead to incorrect results.

Introduction
A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable. In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be {0, 1, ..., k-1} or {1, 2, ..., k} or any other arbitrary set of values. In the following descriptions, we use {1, 2, ..., k} for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses {0, 1}. In this case, the probability mass function f is:

where

represents the probability of seeing element i and

Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket:[2]

where

evaluates to 1 if

, 0 otherwise. There are various advantages of this formulation, e.g.:

It is easier to write out the likelihood function of a set of independent identically distributed categorical variables. It connects the categorical distribution with the related multinomial distribution. It shows why the Dirichlet distribution is the conjugate prior of the categorical distribution, and allows the posterior distribution of the parameters to be calculated. Yet another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter n of the multinomial distribution (the number of sampled items) is fixed at 1. In this formulation, the sample space can be considered to be the set of 1-of-K encoded[3] random vectors x of dimension k having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which

Categorical distribution category has been chosen. The probability mass function f in this formulation is:

57

where Bishop.

represents the probability of seeing element i and


[3][4]

. This is the formulation adopted by

Properties
The distribution is completely given by the probabilities associated with each number i: , i = 1,...,k, where . The possible probabilities are exactly the standard -dimensional simplex; for k = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1-simplex, The distribution is a special case of a "multivariate Bernoulli distribution"[5] in which exactly one of the k 0-1 variables takes the value one. Let be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:

where I is the indicator function. Then Y has a distribution which is a special case of the multinomial distribution with parameter . The sum of independent and identically distributed such random variables Y constructed from a categorical distribution with parameter is multinomially distributed with parameters and

The possible probabilities for the categorical distribution with are the 2-simplex , embedded in 3-space.

The conjugate prior distribution of a categorical distribution is a Dirichlet distribution.[1] See the section below for more discussion. The sufficient statistic from n independent observations is the set of counts (or, equivalently, proportion) of observations in each category, where the total number of trials (=n) is fixed. The indicator function of an observation having a value i, equivalent to the Iverson bracket function the Kronecker delta function is Bernoulli distributed with parameter or

With a conjugate prior


In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution (and also the multinomial distribution). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties.

Categorical distribution Formally, this can be expressed as follows. Given a model

58

then the following holds:[1]

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution. Further intuition comes from the expected value of the posterior distribution (see the article on the Dirichlet distribution):

This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: If, for example, there are three possible categories, and we saw category 1 in our observed data 40% of the time, we would expect on average to see category 1 40% of the time in the posterior distribution as well. (Note that this intuition is ignoring the effect of the prior distribution. Furthermore, it's important to keep in mind that the posterior is a distribution over distributions. Remember that the posterior distribution in general tells us what we know about the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated our data. For example, if we saw the 3 categories in the ratio 40:5:55 in our observed data, then ignoring the effect of the prior distribution, we would expect the true parameter i.e. the true, underlying distribution that generated our observed data to have the average value of (0.40,0.05,0.55), which is indeed what the posterior tells us. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the variance of the posterior, which is controlled by the total number of observations the more data we observe, the less our uncertainty about the true parameter.) (Technically, the prior parameter should actually be seen as representing represents prior observations of category . Then, the updated posterior parameter fact that a Dirichlet distribution with posterior observations. This reflects the

has a completely flat shape essentially, a uniform

distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the term and simply think of the vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting values less than 1.)

Categorical distribution

59

MAP Estimation
The maximum-a-posteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e.,[1]

In many practical applications, the only way to guarantee the condition that all i.

is to set

for

Marginal likelihood
In the above model, the marginal likelihood of the observations (i.e. the joint distribution of the observations, with the prior parameter marginalized out) is a Dirichlet-multinomial distribution:[1]

This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.

Posterior predictive distribution


The posterior predictive distribution of a new observation in the above model is the distribution that a new observation would take given the set of N categorical observations. As shown in the Dirichlet-multinomial distribution article, it has a very simple form:[1]

Note the various relationships among this formula and the previous ones: The posterior predictive probability of seeing a particular category is the same as the relative proportion of previous observations in that category (including the pseudo-observations of the prior). This makes logical sense intuitively, we would expect to see a particular category according to the frequency already observed of that category. The posterior predictive probability is the same as the expected value of the posterior distribution. This is explained more below. As a result, this formula can be expressed as simply "the posterior predictive probability of seeing a category is proportional to the total observed count of that category", or as "the expected count of a category is the same as the total observed count of the category", where "observed count" is taken to include the pseudo-observations of the prior. The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident once we re-examine the above formula. As explained in the posterior predictive distribution article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution:

Categorical distribution

60

The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value i is directly specified by the associated parameter pi. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters. Note also what happens in a scenario in which we observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. If a category has a high frequency of occurrence, then new data points are more likely to join that category further enriching the same category. This type of scenario is often termed a preferential attachment (or "rich get richer") model. This models many real-world processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.

Posterior conditional distribution


In Gibbs sampling, we typically need to draw from conditional distributions in multi-variable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors (e.g. mixture models and models including mixture components), the Dirichlet distributions are often "collapsed out" (marginalized out) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their joint distribution is a Dirichlet-multinomial distribution). One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the posterior predictive distribution of the remaining nodes. That is, for a set of nodes , if we denote the node in question as and the remainder as , then

where

is the number of nodes having category i among the nodes other than node n.

Sampling
The most common way to sample from a categorical distribution uses a type of inverse transform sampling: Assume we are given a distribution expressed as "proportional to" some expression, with unknown normalizing constant. Then, before taking any samples, we prepare some values as follows: 1. Compute the unnormalized value of the distribution for each category. 2. Sum them up and divide each value by this sum, in order to normalize them. 3. Impose some sort of order on the categories (e.g. by an index that runs from 1 to k, where k is the number of categories). 4. Convert the values to a cumulative distribution function (CDF) by replacing each value with the sum of all of the previous values. This can be done in time O(k). The resulting value for the first category will be 0. Then, each time it is necessary to sample a value: 1. Pick a uniformly distributed number between 0 and 1.

Categorical distribution 2. Locate the greatest number in the CDF whose value is less than or equal to the number just chosen. This can be done in time O(log(k)), by binary search. 3. Return the category corresponding to this CDF value. If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O(n) time (assuming an O(1) approximation is used to draw values from the binomial distribution[6]).
function draw_categorical(n) // where n is the number of samples to draw from the categorical distribution r = 1 s = 0 for i from 1 to k // where k is the number of categories v = draw from a binomial(n, p[i] / r) distribution // where p[i] is the probability of category i for j from 1 to v z[s++] = i // where z is an array in which the results are stored n = n - v r = r - p[i] shuffle (randomly re-order) the elements in z return z

61

Notes
[1] Minka, T. (2003) Bayesian inference, entropy and the multinomial distribution (http:/ / research. microsoft. com/ en-us/ um/ people/ minka/ papers/ multinomial. html). Technical report Microsoft Research. [2] Minka, T. (2003), op. cit. Minka uses the Kronecker delta function, similar to but less general than the Iverson bracket. [3] Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer. ISBN 0-387-31073-8 [4] However, Bishop does not explicitly use the term categorical distribution. [5] Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions, Wiley. ISBN 0-471-12844-9 (p.105) [6] Agresti, A., An Introduction to Categorical Data Analysis, Wiley-Interscience, 2007, ISBN 978-0-471-22618-5, pp. 25

References

Dirichlet distribution

62

Dirichlet distribution
Dirichlet Probability density function

Parameters Support PDF

number of categories (integer) concentration parameters, where where and

where where Mean

(see digamma function) Mode Variance where

Entropy

see text

In probability and statistics, the Dirichlet distribution (after Johann Peter Gustav Lejeune Dirichlet), often denoted , is a family of continuous multivariate probability distributions parametrized by a vector of positive reals. It is the multivariate generalization of the beta distribution.[1] Dirichlet distributions are very often used as prior distributions in Bayesian statistics, and in fact the Dirichlet distribution is the conjugate prior of the categorical distribution and multinomial distribution. That is, its probability density function returns the belief that the probabilities of K rival events are given that each event has been observed times. The infinite-dimensional generalization of the Dirichlet distribution is the Dirichlet process.

Dirichlet distribution

63

Probability density function


The Dirichlet distribution of order K2 with parameters 1, ..., K >0 has a probability density function with respect to Lebesgue measure on the Euclidean space RK-1 given by

for all x1, ..., xK1 >0 satisfying x1 + ... + xK1 <1, and where xK = 1 x1 ... xK1. The density is zero outside this open (K1)-dimensional simplex. The normalizing constant is the multinomial Beta function, which can be expressed in terms of the gamma function:

Support
The support of the Dirichlet distribution is the set of the interval ; furthermore, -dimensional vectors whose entries are real numbers in , i.e. the sum of the coordinates is 1. These can be viewed as the

probabilities of a K-way categorical event. Another way to express this is that the domain of the Dirichlet distribution is itself a probability distribution, specifically a -dimensional discrete distribution. Note that the technical term for the set of points in the support of a -dimensional Dirichlet distribution is the open standard -simplex, which is a generalization of a triangle, embedded in the next-higher dimension. For example, with , the support looks like an equilateral triangle embedded in a downward-angle fashion in three-dimensional space, with vertices at away from the origin. and , i.e. touching each of the coordinate axes at a point 1 unit

Special cases
A very common special case is the symmetric Dirichlet distribution, where all of the elements making up the parameter vector have the same value. Symmetric Dirichlet distributions are often used when a Dirichlet prior is called for, since there typically is no prior knowledge favoring one component over another. Since all elements of the parameter vector have the same value, the distribution alternatively can be parametrized by a single scalar value , called the concentration parameter. The density function then simplifies to

When

concentration-parameter-disambiguation

, the symmetric Dirichlet distribution is equivalent to a uniform

distribution over the open standard

-simplex, i.e. it is uniform over all points in its support. Values of the

concentration parameter above 1 prefer variates that are dense, evenly distributed distributions, i.e. all probabilities returned are similar to each other. Values of the concentration parameter below 1 prefer sparse distributions, i.e. most of the probabilities returned will be close to 0, and the vast majority of the mass will be concentrated in a few of the probabilities. More generally, parameter vector is sometimes written as the product of a (scalar) concentration parameter and a (vector) base measure where lies within the K-1 simplex (i.e.: its coordinates sum to one). The concentration parameter in this case is larger by a factor of K than the concentration parameter for a symmetric Dirichlet distribution described above. This construction ties in with concept of a base measure when discussing Dirichlet processes and is often used in the topic modelling literature. If we define the concentration parameter as the sum of the Dirichlet parameters for each dimension, the Dirichlet distribution is uniform with a concentration parameter is K, the dimension of the distribution.

Dirichlet distribution

64

Properties
Moments
Let . Then[2][3] , meaning that the first K1 components have the above density and

Define

Furthermore, if

(Note that the matrix so defined is singular.).

Mode
The mode of the distribution is the vector (x1, ..., xK) with

Marginal distributions
The marginal distributions are beta distributions:[4]

Conjugate to categorical/multinomial
The Dirichlet distribution is the conjugate prior distribution of the categorical distribution (a generic discrete probability distribution with a given number of possible outcomes) and multinomial distribution (the distribution over observed counts of each possible category in a set of categorically distributed observations). This means that if a data point has either a categorical or multinomial distribution, and the prior distribution of the data point's parameter (the vector of probabilities that generates the data point) is distributed as a Dirichlet, then the posterior distribution of the parameter is also a Dirichlet. Intuitively, in such a case, starting from what we know about the parameter prior to observing the data point, we then can update our knowledge based on the data point and end up with a new distribution of the same form as the old one. This means that we can successively update our knowledge of a parameter by incorporating new observations one at a time, without running into mathematical difficulties. Formally, this can be expressed as follows. Given a model

then the following holds:

This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the

Dirichlet distribution counts for all the new observations (the vector c) in order to derive the posterior distribution. In Bayesian mixture models and other hierarchical Bayesian models with mixture components, Dirichlet distributions are commonly used as the prior distributions for the categorical variables appearing in the models. See the section on applications below for more information.

65

Relation to Dirichlet-multinomial distribution


In a model where a Dirichlet prior distribution is placed over a set of categorical-valued observations, the marginal joint distribution of the observations (i.e. the joint distribution of the observations, with the prior parameter marginalized out) is a Dirichlet-multinomial distribution. This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.

Entropy
If X is a Dir() random variable, then the exponential family differential identities can be used to get an analytic expression for the expectation of and its associated covariance matrix:

and

where

is the digamma function,

is the trigamma function, and

is the Kronecker delta. The formula for

yields the following formula for the information entropy of X:

Aggregation
If from the vector and replaced by their sum, This aggregation property may be used to derive the marginal distribution of mentioned above. then, if the random variables with subscripts i and j are dropped

Neutrality
If independent of
[6]

, then the vector where

is said to be neutral[5] in the sense that

is

and similarly for removing any of

. Observe that any permutation of


[7]

is also neutral (a property

not possessed by samples drawn from a generalized Dirichlet distribution.)

Dirichlet distribution

66

Related distributions
If, for then [8]

and

Although the Xis are not independent from one another, they can be seen to be generated from a set of independent gamma random variables (see
[9]

for proof). Unfortunately, since the sum

is lost in forming X, it is

not possible to recover the original gamma random variables from these values alone. Nevertheless, because independent random variables are simpler to work with, this reparametrization can still be useful for proofs about properties of the Dirichlet distribution.

Applications
Dirichlet distributions are most commonly used as the prior distribution of categorical variables or multinomial variables in Bayesian mixture models and other hierarchical Bayesian models. (Note that in many fields, such as in natural language processing, categorical variables are often imprecisely called "multinomial variables". Such a usage is liable to cause confusion, just as if Bernoulli distributions and binomial distributions were commonly conflated.) Inference over hierarchical Bayesian models is often done using Gibbs sampling, and in such a case, instances of the Dirichlet distribution are typically marginalized out of the model by integrating out the Dirichlet random variable. This causes the various categorical variables drawn from the same Dirichlet random variable to become correlated, and the joint distribution over them assumes a Dirichlet-multinomial distribution, conditioned on the hyperparameters of the Dirichlet distribution (the concentration parameters). One of the reasons for doing this is that Gibbs sampling of the Dirichlet-multinomial distribution is extremely easy; see that article for more information.

Random number generation


Gamma distribution
A fast method to sample a random vector parameters from gamma distributions each with density from the K-dimensional Dirichlet distribution with follows immediately from this connection. First, draw K independent random samples

and then set

Below is example python code to draw the sample: params = [a1, a2, ..., ak] sample = [random.gammavariate(a,1) for a in params] sample = [v/sum(sample) for v in sample]

Dirichlet distribution

67

Marginal beta distributions


A less efficient algorithm[10] relies on the univariate marginal and conditional distributions being beta and proceeds as follows. Simulate For Finally, set from a , simulate . from a distribution. Then simulate distribution, and let in order, as follows. .

Below is example python code to draw the sample: params = [a1, a2, ..., ak] xs = [random.betavariate(params[0], sum(params[1:]))] for j in range(1,len(params)-1): phi = random.betavariate(params[j], sum(params[j+1:])) xs.append((1-sum(xs)) * phi) xs.append(1-sum(xs))

Intuitive interpretations of the parameters


The concentration parameter
Dirichlet distributions are very often used as prior distributions in Bayesian inference. The simplest and perhaps most common type of Dirichlet prior is the symmetric Dirichlet distribution, where all parameters are equal. This corresponds to the case where you have no prior information to favor one component over any other. As described above, the single value to which all parameters are set is called the concentration parameter. If the sample space of the Dirichlet distribution is interpreted as a discrete probability distribution, then intuitively the concentration parameter can be thought of as determining how "concentrated" the probability mass of a sample from a Dirichlet distribution is likely to be. With a value much less than 1, the mass will be highly concentrated in a few components, and all the rest will have almost no mass. With a value much greater than 1, the mass will be dispersed almost equally among all the components. See the article on the concentration parameter for further discussion.

Dirichlet distribution

68

String cutting
One example use of the Dirichlet distribution is if one wanted to cut strings (each of initial length 1.0) into K pieces with different lengths, where each piece had a designated average length, but allowing some variation in the relative sizes of the pieces. The /0 values specify the mean lengths of the cut pieces of string resulting from the distribution. The variance around this mean varies inversely with 0.

Plya's urn
Consider an urn containing balls of K different colors. Initially, the urn contains 1 balls of color 1, 2 balls of color 2, and so on. Now perform N draws from the urn, where after each draw, the ball is placed back into the urn with an additional ball of the same color. In the limit as N approaches infinity, the proportions of different colored balls in the urn will be distributed as Dir(1,...,K).[11] For a formal proof, note that the proportions of the different colored balls form a bounded [0,1]K-valued martingale, hence by the martingale convergence theorem, these proportions converge almost surely and in mean to a limiting random vector. To see that this limiting vector has the above Dirichlet distribution, check that all mixed moments agree. Note that each draw from the urn modifies the probability of drawing a ball of any one color from the urn in the future. This modification diminishes with the number of draws, since the relative effect of adding a new ball to the urn diminishes as the urn accumulates increasing numbers of balls. This "diminishing returns" effect can also help explain how small values yield Dirichlet distributions with most of the probability mass concentrated around a single point on the simplex.

Dirichlet distribution

69

References
[1] S. Kotz, N. Balakrishnan, and N. L. Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York: Wiley. ISBN0-471-18387-3. (Chapter 49: Dirichlet and Inverted Dirichlet Distributions) [2] Eq. (49.9) on page 488 of Kotz, Balakrishnan & Johnson (2000). Continuous Multivariate Distributions. Volume 1: Models and Applications. New York: Wiley. (http:/ / www. wiley. com/ WileyCDA/ WileyTitle/ productCd-0471183873. html) [3] BalakrishV. B. (2005). ""Chapter 27. Dirichlet Distribution"". A Primer on Statistical Distributions. Hoboken, NJ: John Wiley & Sons, Inc.. p.274. ISBN978-0-471-42798-8. [4] Ferguson, Thomas S. (1973). "A Bayesian analysis of some nonparametric problems". The Annals of Statistics 1 (2): 209230. doi:10.1214/aos/1176342360. [5] Connor, Robert J.; Mosimann, James E (1969). "Concepts of Independence for Proportions with a Generalization of the Dirichlet Distribution". Journal of the American statistical association (American Statistical Association) 64 (325): 194206. doi:10.2307/2283728. JSTOR2283728. [6] Bela A. Frigyik, Amol Kapila, and Maya R. Gupta (2010). "Introduction to the Dirichlet Distribution and Related Processes" (http:/ / ee. washington. edu/ research/ guptalab/ publications/ UWEETR-2010-0006. pdf) (Technical Report UWEETR-2010-006). University of Washington Department of Electrical Engineering. . Retrieved May 2012. [7] See Kotz, Balakrishnan & Johnson (2000), Section 8.5, "Connor and Mosimann's Generalization", pp. 519521. [8] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ chapter_nine. pdf). pp.402. . [9] Devroye, Luc (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). pp.594. . (Chapter 11.) [10] A. Gelman and J. B. Carlin and H. S. Stern and D. B. Rubin (2003). Bayesian Data Analysis (2nd ed.). pp.582. ISBN1-58488-388-X. [11] Blackwell, David; MacQueen, James B. (1973). "Ferguson distributions via Polya urn schemes". Ann. Stat. 1 (2): 353355. doi:10.1214/aos/1176342372.

External links
Dirichlet Distribution (http://www.cis.hut.fi/ahonkela/dippa/node95.html) How to estimate the parameters of the Dirichlet distribution using expectation-maximization (EM) (http://www. ee.washington.edu/research/guptalab/publications/EMbookChenGupta2010.pdf) Luc Devroye. "Non-Uniform Random Variate Generation" (http://luc.devroye.org/rnbookindex.html). Retrieved May 2012. Dirichlet Random Measures, Method of Construction via Compound Poisson Random Variables, and Exchangeability Properties of the resulting Gamma Distribution (http://www.cs.princeton.edu/courses/ archive/fall07/cos597C/scribe/20071130.pdf)

70

Continuous Distributions on [a,b]


Uniform distribution (continuous)
Uniform Probability density function

Using maximum convention Cumulative distribution function

Notation Parameters Support PDF

CDF

Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF 0 any value in

Uniform distribution (continuous) In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of probability distributions such that for each member of the family, all intervals of the same length on the distribution's support are equally probable. The support is defined by the two parameters, a and b, which are its minimum and maximum values. The distribution is often abbreviated U(a,b). It is the maximum entropy probability distribution for a random variate X under no constraint other than that it is contained in the distribution's support.[1]

71

Characterization
Probability density function
The probability density function of the continuous uniform distribution is:

The values of f(x) at the two boundaries a and b are usually unimportant because they do not alter the values of the integrals of f(x)dx over any interval, nor of xf(x)dx or any higher moment. Sometimes they are chosen to be zero, and sometimes chosen to be 1/(ba). The latter is appropriate in the context of estimation by the method of maximum likelihood. In the context of Fourier analysis, one may take the value of f(a) or f(b) to be 1/(2(ba)), since then the inverse transform of many integral transforms of this uniform function will yield back the function itself, rather than a function which is equal "almost everywhere", i.e. except on a set of points with zero measure. Also, it is consistent with the sign function which has no such ambiguity. In terms of mean and variance 2, the probability density may be written as:

Cumulative distribution function


The cumulative distribution function is:

Its inverse is:

In mean and variance notation, the cumulative distribution function is:

and the inverse is:

Uniform distribution (continuous)

72

Generating functions
Moment-generating function The moment-generating function is

from which we may calculate the raw moments m k

For a random variable following this distribution, the expected value is then m1 = (a+b)/2 and the variance is m2m12 = (ba)2/12. Cumulant-generating function For n2, the nth cumulant of the uniform distribution on the interval [0,1] is bn/n, where bn is the nth Bernoulli number.

Properties
Moments and parameters
The first two moments of the distribution are:

Solving these two equations for parameters a and b, given known moments E(X) and V(X), yields:

Order statistics
Let X1, ..., Xn be an i.i.d. sample from U(0,1). Let X(k) be the kth order statistic from this sample. Then the probability distribution of X(k) is a Beta distribution with parameters k and nk+1. The expected value is

This fact is useful when making Q-Q plots. The variances are

Uniform distribution (continuous)

73

Uniformity
The probability that a uniformly distributed random variable falls within any interval of fixed length is independent of the location of the interval itself (but it is dependent on the interval size), so long as the interval is contained in the distribution's support. To see this, if X ~ U(a,b) and [x, x+d] is a subinterval of [a,b] with fixed d > 0, then

which is independent of x. This fact motivates the distribution's name.

Generalization to Borel sets


This distribution can be generalized to more complicated sets than intervals. If S is a Borel set of positive, finite measure, the uniform probability distribution on S can be specified by defining the pdf to be zero outside S and constantly equal to 1/K on S, where K is the Lebesgue measure of S.

Standard uniform
Restricting and , the resulting distribution U(0,1) is called a standard uniform distribution. One interesting property of the standard uniform distribution is that if u1 has a standard uniform distribution, then so does 1-u1. This property can be used for generating antithetic variates, among other things.

Related distributions
If X has a standard uniform distribution, then by the inverse transform sampling method, Y = ln(X) / has an exponential distribution with (rate) parameter . If X has a standard uniform distribution, then Y = Xn has a beta distribution with parameters 1/n and 1. (Note this implies that the standard uniform distribution is a special case of the beta distribution, with parameters 1 and 1.) The IrwinHall distribution is the sum of n i.i.d. U(0,1) distributions. The sum of two independent, equally distributed, uniform distributions yields a symmetric triangular distribution.

Relationship to other functions


As long as the same conventions are followed at the transition points, the probability density function may also be expressed in terms of the Heaviside step function:

or in terms of the rectangle function

There is no ambiguity at the transition point of the sign function. Using the half-maximum convention at the transition points, the uniform distribution may be expressed in terms of the sign function as:

Uniform distribution (continuous)

74

Applications
In statistics, when a p-value is used as a test statistic for a simple null hypothesis, and the distribution of the test statistic is continuous, then the p-value is uniformly distributed between 0 and 1 if the null hypothesis is true.

Sampling from a uniform distribution


There are many applications in which it is useful to run simulation experiments. Many programming languages have the ability to generate pseudo-random numbers which are effectively distributed according to the standard uniform distribution. If u is a value sampled from the standard uniform distribution, then the value a + (b a)u follows the uniform distribution parametrised by a and b, as described above.

Sampling from an arbitrary distribution


The uniform distribution is useful for sampling from arbitrary distributions. A general method is the inverse transform sampling method, which uses the cumulative distribution function (CDF) of the target random variable. This method is very useful in theoretical work. Since simulations using this method require inverting the CDF of the target variable, alternative methods have been devised for the cases where the cdf is not known in closed form. One such method is rejection sampling. The normal distribution is an important example where the inverse transform method is not efficient. However, there is an exact method, the BoxMuller transformation, which uses the inverse transform to convert two independent uniform random variables into two independent normally distributed random variables.

Estimation
Estimation of maximum
Given a uniform distribution on [0,N] with unknown N, the UMVU estimator for the maximum is given by

where m is the sample maximum and k is the sample size, sampling without replacement (though this distinction almost surely makes no difference for a continuous distribution). This follows for the same reasons as estimation for the discrete distribution, and can be seen as a very simple case of maximum spacing estimation. This problem is commonly known as the German tank problem, due to application of maximum estimation to estimates of German tank production during World War II.

Estimation of midpoint
The midpoint of the distribution (a+b)/2 is both the mean and the median of the uniform distribution. Although both the sample mean and the sample median are unbiased estimators of the midpoint, neither is as efficient as the sample mid-range, i.e. the arithmetic mean of the sample maximum and the sample minimum, which is the UMVU estimator of the midpoint (and also the maximum likelihood estimate).

Confidence interval for the maximum


Let X1, X2, X3, ..., Xn be a sample from U( 0, L ) where L is the population maximum. Then X(n) = max( X1, X2, X3, ..., Xn ) has the density[2]

Uniform distribution (continuous) The confidence interval for the estimated population maximum is then ( X(n), X(n) / 1/n ) where 100 ( 1 - )% is the confidence level sought. In symbols

75

References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [2] Nechval KN, Nechval NA, Vasermanis EK, Makeev VY (2002) Constructing shortest-length confidence intervals. Transport and Telecommunication 3 (1) 95-103

External links
Online calculator of Uniform distribution (continuous) (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/ ro.php?language=english)

Beta distribution

76

Beta distribution
Beta Probability density function

Cumulative distribution function

Notation Parameters Support PDF CDF Mean shape (real) shape (real)

(see digamma function and see text: Geometric mean) Median

Mode Variance

for

(see trigamma function and see text: Geometric variance) Skewness

Beta distribution

77

Ex. kurtosis Entropy MGF CF Fisher information matrix" (see Confluent hypergeometric function) see text: "Parameter estimation", "Fisher information

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by and . The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines. Some examples follow. In population genetics it has been employed for a statistical description of the allele frequencies in the components of a sub-divided population[1] . It has been utilized in PERT[2], critical path method (CPM) and other project management / control systems to describe the statistical distributions of the time to completion and the cost of a task. It has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported as a good indicator of the condition of gears.[3] It was used to model sunshine data for application to solar renewable energy utilization.[4] It has been utilized for parametrizing variability of soil properties at the regional level for crop yield estimation, modeling crop response over the area of the association.[5] It was selected to determine well-log shale parameters, to describe the proportions of the mineralogical components existing in a certain stratigraphic interval.[6] Heterogeneity in the probability of HIV transmission in heterosexual contact has been modeled as a random variable in a beta distribution, and parameters estimated by maximum-likelihood[7] . In Bayesian inference, beta distributions provide a family of conjugate prior probability distributions for binomial and geometric distributions. For example, the beta distribution can be used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the random behavior of percentages and proportions. One theoretical case where the beta distribution arises is as the distribution of the ratio formed by one random variable having a Gamma distribution divided by the sum of it and another independent random variable also having a Gamma distribution with the same scale parameter (but possibly different shape parameter). The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution.

Beta distribution

78

Characterization
Probability density function
The probability density function of the beta distribution, for 0x1, and shape parameters > 0 and > 0, is a power function of the variable x and of its reflection (1-x) as follows:

where

is the gamma function. The beta function, B, appears as a normalization constant to ensure that the total

probability integrates to unity. This definition includes both ends x= 0 and x=1, which is consistent with definitions for other continuous distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine distribution, and consistent with several authors, such as N.L.Johnson and S.Kotz.[8][9][10][11] However, several other authors, including W. Feller,[12][13][14] choose to exclude the ends x= 0 and x=1, (such that the two ends are not actually part of the density function) and consider instead 0<x<1. Several authors, including N.L.Johnson and S.Kotz,[8] use the nomenclature p instead of and q instead of for the shape parameters of the beta distribution, reminiscent of the nomenclature traditionally used for the parameters of the Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit as both shape parameters and approach the value of zero. In the following, that a random variable X is Beta-distributed with parameters and will be denoted by:

Cumulative distribution function


The cumulative distribution function is

where

is the incomplete beta function and

is the

regularized incomplete beta function.

Properties
Measures of central tendency
CDF for symmetric Beta distribution vs. x and alpha=beta

Beta distribution

79

Mode The mode of a Beta distributed random variable X with both parameters and greater than one is:[8]

When both parameters are less than one ( < 1 and < 1), this is the anti-mode: the lowest point of the probability density curve.[10] Letting in the above expression one obtains showing that for ,
CDF for skewed Beta distribution vs. x and beta= 5 alpha

the mode (in the case > 1 and > 1), or the

anti-mode (in the case < 1 and < 1), are at the center of the distribution: it is symmetric in those cases. See "Shapes" section in this article for a full list of mode cases, for arbitrary values of and . For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite, for example in the case of =2, =1 (or =1, =2), the right-triangle distribution, while in several other cases there is a singularity at the end, and hence the value of the density function approaches infinity at the end, for example in the case ==1/2, the arcsine distribution. The choice whether to include, or not to include, the ends x=0, and x=1, as part of the density function, whether a singularity can be considered to be a mode, and whether cases with two maxima are to be considered bimodal, is responsible for some authors considering these maximum values at the end of the density distribution to be considered[15] modes or not.[13]

Mode for Beta distribution for 15 and 15

Beta distribution Median The median of the beta distribution is the unique real number for which the

80

regularized incomplete beta function . There is no general closed-form expression for the median of the beta distribution for arbitrary values of and . Closed-form expressions for particular values of the parameters and follow: For symmetric cases . For (this case is the mirror-image of the power function [0,1] distribution) For (this case is the power function [0,1] distribution[13]) For the real [0,1] solution to the quartic equation For Following are the limits with one parameter finite (non zero) and the other approaching these limits:
Median for Beta distribution for 05 and 05

(Mean - Median) for Beta distribution versus alpha and beta from 0 to 2

A reasonable approximation of the value of the median of the beta distribution, for both and greater or equal to one, is given by the formula[16]

For 1 and 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both 2 and 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small:

Beta distribution

81

Mean The expected value (mean) ( ) of a

Beta distribution random variable X with two parameters and is a function of only the ratio of these parameters:[8]

Mean for Beta distribution for 05 and 05

Letting

in the above expression one obtains

, showing that for

the mean is at the center of

the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:

Therefore, for

, or for

, the mean is located at the right end, x = 1. For these limit ratios, the beta

distribution becomes a 1 point Degenerate distribution with a Dirac delta function spike at the right end, x = 1, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at

Beta distribution the right end, x = 1. Similarly, for , or for , the mean is located at the left end, x = 0. The beta distribution becomes a

82

1 point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0. Following are the limits with one parameter finite (non zero) and the other approaching these limits:

Geometric mean The logarithm of the geometric mean of a distribution with

random variable X is the arithmetic mean of lnX, or, equivalently, its expected value:

For a beta distribution, the expected value integral gives:

(Mean - GeometricMean) for Beta Distribution versus and from 0 to 2, showing the asymmetry between and for the geometric mean

Geometric Means for Beta distribution Purple=G(X), Yellow=G(1-X), smaller values alpha and beta in front

Geometric Means for Beta distribution Purple=G(X), Yellow=G(1-X), larger values alpha and beta in front

Beta distribution

83

where

is the digamma function.

Therefore the geometric mean of a beta distribution with shape parameters and is the exponential of the digamma functions of and as follows:

While for a beta distribution with equal shape parameters = , it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: 0 < < 1/2. The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln X strongly tends towards negative infinity as X approaches zero, while ln X flattens towards zero as X approaches 1. Along a line = , the following limits apply:

Following are the limits with one parameter finite (non zero) and the other approaching these limits:

The accompanying plot shows the difference between the mean and the geometric mean for shape parameters and from zero to 2. Besides the fact that the difference between them approaches zero as and approach infinity and that the difference becomes large for values of and approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters and . The difference between the geometric mean and the mean is larger for small values of in relation to than when exchanging the magnitudes of and . N.L.Johnson and S.Kotz[8] suggest the logarithmic approximation to the digamma function which results in the following approximation to the geometric mean: Numerical values for the relative error in this approximation follow: [( = = 1): 9.39%]; [( = = 2): 1.29%]; [( = 2, = 3): 1.51%]; [( = 3, = 2): 0.44%]; [( = = 3): 0.51%]; [( = = 4): 0.26%];[( = 3, = 4): 0.55%]; [( = 4, = 3): 0.24%]. Similarly, one can calculate the value of shape parameters required for the geometric mean to equal 1/2. Let's say that we know one of the parameters, , what would be the value of the other parameter, , required for the geometric mean to equal 1/2 ?. The answer is that (for > 1), the value of required tends towards + 1/2 as . For example, all these couples have the same geometric mean of 1/2: [=1,=1.4427], [=2,=2.46958], [=3,=3.47943], [=4,=4.48449], [=5,=5.48756], [=10,=10.4938], [=100,=100.499]. The fundamental property of the geometric mean, which can be proven to be false for any other mean, is

This makes the geometric mean the only correct mean when averaging normalized results, that is results that are presented as ratios to reference values.[17] This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric mean based on

Beta distribution the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation (1-X), the mirror-image of X, denoted by :

84

Along a line = , the following limits apply:

Following are the limits with one parameter finite (non zero) and the other approaching these limits:

It has the following approximate value: Although both and are asymmetric, in the case that both shape parameters are equal . , the

geometric means are equal: between both geometric means: Harmonic mean The inverse of the harmonic mean (

. This equality follows from the following symmetry displayed

) of a distribution with

random variable X is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean ( ) of a beta distribution with shape parameters and is: The harmonic mean ( ) of a Beta distribution with < 1 is
Harmonic mean for Beta distribution for 0<<5 and 0<<5

undefined, because its defining expression is not bounded in [0,1] for shape parameter less than unity. Letting in the above expression one obtains the harmonic mean ranges from 0, for =

, showing that for

= 1, to 1/2, for = . Following are the limits with one parameter finite (non zero) and the other approaching these limits:

(Mean - HarmonicMean) for Beta distribution versus alpha and beta from 0 to 2

Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), smaller values alpha and beta in front

Beta distribution

85

Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), larger values alpha and beta in front

The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1-X), the mirror-image of X, denoted by The harmonic mean ( : ) of a Beta distribution with < 1 is undefined, because its defining expression is not

bounded in [0,1] for shape parameter less than unity. Letting in the above expression one obtains , showing that for the harmonic

mean ranges from 0, for = = 1, to 1/2, for = . Following are the limits with one parameter finite (non zero) and the other approaching these limits:

Although both

and

are asymmetric, in the case that both shape parameters are equal

, the

harmonic means are equal: both harmonic means: .

. This equality follows from the following symmetry displayed between

Measures of statistical dispersion


Variance The variance (the second moment centered around the mean) of a Beta distribution random variable X with parameters and is:[8]

Letting

in the above expression one obtains increases. Setting

, showing that for

the

variance decreases monotonically as


[8]

in this expression, one finds the

maximum variance which only occurs approaching the limit, at . The beta distribution may also be parametrized in terms of its mean (0 < < 1) and sample size = + ( > 0) (see section below titled "Mean and sample size"):

Beta distribution

86

Using this parametrization, one can express the variance in terms of the mean and the sample size as follows:

Since

, it must follow that

For a symmetric distribution, the mean is at the middle of the distribution, = 1/2, and therefore:

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:

Geometric variance and covariance The logarithm of the geometric variance centered around the geometric mean of X, : of a distribution

with random variable X is the second moment of the logarithm of X

log geometric variances vs. and

Beta distribution

87

log geometric variances vs. and

and therefore, the geometric variance is:

In the Fisher information matrix, and the curvature of the log likelihood function, the logarithm of the geometric variance of the reflected variable (1-X) and the logarithm of the geometric covariance between X and (1-X) appear:

For a beta distribution, higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions. See the section titled "Other moments, Moments of transformed random variables, Moments of logarithmically-transformed random variables". The variance of the logarithmic variables and covariance of lnX and ln(1-X) are:

where the trigamma function, denoted derivative of the digamma function: Therefore,

, is the second of the polygamma functions, and is defined as the .

Beta distribution

88

The accompanying plots show the log geometric variances and log geometric covariance versus the shape parameters and . The plots show that the log geometric variances and log geometric covariance are close to zero for shape parameters and greater than 2, and that the log geometric variances rapidly rise in value for shape parameter values and less than unity. The log geometric variances are positive for all values of the shape parameters. The log geometric covariance is negative for all values of the shape parameters, and it reaches large negative values for and less than unity. Following are the limits with one parameter finite (non zero) and the other approaching these limits:

Limits with two parameters varying:

Although both following symmetry

and displayed

are asymmetric, in the case that both shape parameters are equal . This equality follows from the between both log geometric variances: . The log geometric covariance is symmetric:

, the log geometric variances are equal:

Mean absolute deviation around the mean The mean absolute deviation around the mean for the beta distribution with shape parameters and is [13]:

Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with and ranging from 0 to 5

Beta distribution

89

Ratio of Mean Abs.Dev. to Std.Dev. for Beta distribution with mean 01 and sample size 0<10

The mean absolute deviation around the mean is a more robust estimator of statistical dispersion than the standard deviation, as it depends on the linear (absolute) deviations rather than the square deviations from the mean. Therefore the effect of very large deviations from the mean are not as overly weighted. The term "absolute deviation" does not uniquely identify a measure of statistical dispersion, as there are several measures that can be used to measure absolute deviations, and there are several measures of central tendency that can be used as well. Thus, to uniquely identify the absolute deviation it is necessary to specify both the measure of deviation and the measure of central tendency. Unfortunately, the statistical literature has not yet adopted a standard notation, as both the mean absolute deviation around the mean and the median absolute deviation around the median have been denoted by their initials "MAD" in the literature, which may lead to confusion, since in general, they may have values considerably different from each other. Using Stirling's approximation to the Gamma function, N.L.Johnson and S.Kotz[8] derived the following approximation for values of the shape parameters greater than unity (the relative error for this approximation is only -3.5% for = = 1, and it decreases to zero as , ): At the limit , , the ratio of the mean absolute deviation to the standard deviation (for the beta distribution) becomes equal to the ratio of the same measures for the normal distribution: ratio equals . For = = 1 this

, so that from = = 1 to , the ratio decreases by 8.5%. For = = 0 the standard

deviation is exactly equal to the mean absolute deviation around the mean. Therefore this ratio decreases by 15% from = = 0 to = = 1, and by 25% from = = 0 to , . However, for skewed beta distributions such that 0 or 0, the ratio of the standard deviation to the mean absolute deviation approaches infinity (although each of them, individually, approaches zero) because the mean absolute deviation approaches zero faster than the standard deviation. Using the parametrization in terms of mean and sample size = + :

one can express the mean absolute deviation around the mean in terms of the mean and the sample size as follows:

For a symmetric distribution, the mean is at the middle of the distribution, = 1/2, and therefore:

Beta distribution

90

Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:

Skewness
The skewness (the third moment centered around the mean, normalized by the 3/2 power of the variance) of the beta distribution is[8]

Skewness for Beta Distribution as a function of variance and mean

Letting

in the above expression one obtains

, showing once again that for

the distribution

is symmetric and hence the skewness is zero. Positive skew (right-tailed) for < , negative skew (left-tailed) for > . Using the parametrization in terms of mean and sample size = + :

one can express the skewness in terms of the mean and the sample size as follows:

The skewness can also be expressed just in terms of the variance var and the mean as follows:

The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is coupled with zero skewness and the symmetry condition ( = 1/2), and that maximum skewness (positive or negative infinity) occurs when the mean is located at one end or the other, so that that the "mass" of the probability distribution is concentrated at the ends (minimum variance). The following expression for the square of the skewness, in terms of the sample size = + and the variance var, is useful for the method of moments estimation of four parameters:

Beta distribution This expression correctly gives a skewness of zero for = , since in that case (see section titled "Variance"): . For the symmetric case ( = ), skewness = 0 over the whole range, and the following limits apply:

91

For the unsymmetric cases ( ) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:

Kurtosis
The beta distribution has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported to be a good indicator of the condition of a gear.[3] Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate Excess Kurtosis for Beta Distribution as a function of variance and mean continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so its much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc.[18] Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping[19] use the symbol

Beta distribution

92

for the excess kurtosis, but Abramowitz and Stegun[20] use different terminology. To prevent confusion[21] between kurtosis (the fourth moment centered around the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows[13][14]:

Letting

in the above expression one obtains .

Therefore for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of - 2 at the limit as { = } 0, and approaching a maximum value of zero as { = } . The value of - 2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2-point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "peakedness" (or "heavy tails") of the probability distribution, is strictly applicable to unimodal distributions (for example the normal distribution). However, for more general distributions, like the beta distribution, a more general description of kurtosis is that it is a measure of the proportion of the mass density near the mean. The higher the proportion of mass density near the mean, the higher the kurtosis, while the higher the mass density away from the mean, the lower the kurtosis. For , skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for 0 for finite , or for 0 for finite ) because all the mass density is concentrated at the mean when the mean coincides with one of the ends. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends. Using the parametrization in terms of mean and sample size = + :

one can express the excess kurtosis in terms of the mean and the sample size as follows:

The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size as follows:

and, in terms of the variance var and the mean as follows:

The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (- 2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint ( = 1/2). This

Beta distribution occurs for the symmetric case of = = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi-"peaky" with nothing in between them. On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end ( = 0 or = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end. Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size as follows:

93

From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in his paper,[22] for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness"). Setting + = = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and excess kurtosis below the boundary (excess kurtosis + 2 - skewness2 = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of + = determines Pearson's upper boundary.

therefore:

Values of = + such that ranges from zero to infinity, 0 < < , span the whole region of the beta distribution in the plane of excess kurtosis versus squared skewness. For the symmetric case ( = ), the following limits apply:

For the unsymmetric cases ( ) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:

Beta distribution

94

Characteristic function
The characteristic function is the Fourier transform of the probability density function. The characteristic function of the beta distribution is Kummer's confluent hypergeometric function (of the first kind)[8][20][23]:

Re(characteristic function) symmetric case = ranging from 25 to 0

Beta distribution

95

Re(characteristic function) symmetric case = ranging from 0 to 25

Re(characteristic function) = + 1/2; ranging from 25 to 0

Re(characteristic function) = + 1/2; ranging from 25 to 0

Beta distribution

96

Re(characteristic function) = + 1/2; ranging from 0 to 25

where

is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one: . Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t:

The symmetric case = simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind ) using Kummer's second transformation as follows:

In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric ( = ) and skewed ( ) cases.

Beta distribution

97

Other moments
Moment generating function It also follows[8][13] that the moment generating function is

Higher moments Using the moment generating function, the multiplying the (exponential series) term th raw moment is given by[8] the factor in the series of the moment generating function

where

is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as

Moments of transformed random variables Moments of linearly-transformed, product and inverted random variables One can also show the following expectations for a transformed random variable,[8] where the random variable X is Beta-distributed with parameters and : . The expected value of the variable (1-X) is the mirror-symmetry of the expected value based on X:

Due to the mirror-symmetry of the probability density function of the beta distribution, the variances and covariance based on variables X and (1-X) are identical:

These are the expected values for inverted variables, (these are related to the harmonic means, see section titled "Harmonic mean"):

Beta distribution

98

The following transformation by dividing the variable X by its mirror-image (X/(1 - X) results in the expected value of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) [8]:

Also:

Variances of these transformed variables can be obtained by integration, as the expected values of the second moments centered around the corresponding variables:

The following variance of the variable X divided by its mirror-image (X/(1 - X) results in the variance of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) [8]:

The covariances are: These expectations and variances appear in the four-parameter Fisher information matrix (section titled "Fisher information," "four parameters") Moments of logarithmically-transformed random variables Expected values for logarithmic transformations (useful for maximum likelihood estimates, see section titled "Parameter estimation, Maximum likelihood" below) are discussed in this section. The following logarithmic linear transformations are related to the geometric means and (see section titled "Geometric mean"):

Plot of logit(X)=ln(X/(1-X)) (vertical axis) vs. X in the domain of 0 to 1 (horizontal axis). Logit transformations are interesting, as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable

Beta distribution

99

Where the digamma function

is defined as the logarithmic derivative of the gamma function:[20]

Logit transformations are interesting[24] , as they usually transform various shapes (including J-shapes) into (usually skewed) bell-shaped densities over the logit variable, and they may remove the end singularities over the original variable:

Johnson[25] considered the distribution of the logit - transformed variable

, including its moment

generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0,1] based on the original variable X to infinite support in both directions of the real line . Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order poly-gamma functions as follows:

therefore the variance of the logarithmic variables and covariance of lnX and ln(1-X) are:

where the trigamma function, denoted derivative of the digamma function:

, is the second of the polygamma functions, and is defined as the .

The variances and covariance of the logarithmically transformed variables X and (1-X) are different, in general, because the logarithmic transformation destroys the mirror-symmetry of the original variables X and (1-X), as the logarithm approaches negative infinity for the variable approaching zero. These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation). The variances of the log inverse variables are identical to the variances of the log variables:

Beta distribution

100

It also follows that the variances of the logit transformed variables are:

Quantities of information (entropy)


Given a beta distributed random variable, X ~ Beta(, ), the differential entropy of X is[26] (measured in nats), the expected value of the negative of the logarithm of the probability density function:

where

is the probability density function of the beta distribution:

The digamma function

appears in the formula for the differential entropy as a consequence of Euler's integral

formula for the harmonic numbers which follows from the integral:

The differential entropy of the beta distribution is negative for all values of and greater than zero, except at ==1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable. For or approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) or approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) or approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either or approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), = , and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else.

Beta distribution

101

The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part[27] of the same paper where he defined the discrete entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy. Given two beta distributed random variables, in nats)
[28]

~ Beta(, ) and

~ Beta(', '), the cross entropy is (measured

The cross entropy has been used as an error metric to measure the distance between two hypotheses [29] [30] . Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood [28](see section on "Parameter estimation. Maximum likelihood estimation")). The relative entropy, or KullbackLeibler divergence that the distribution is in nats). ~ Beta(', ') when the distribution is , is a measure of the inefficiency of assuming ~ Beta(, ). It is defined as follows (measured

The relative entropy, or KullbackLeibler divergence, is always non-negative. A few numerical examples follow: ~ Beta(1, 1) and = - 0.267864 ~ Beta(3, 0.5) and 1.10805; ~ Beta(3, 3); ~ Beta(0.5, 3); = 0.598803; = 7.21574; = 0.267864; = 7.21574; = 0; =-

= - 1.10805 for the case in which the

The KullbackLeibler divergence is not symmetric

individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies . The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bell-shaped) Beta(3, 3), rather than

Beta distribution (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bell-shaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the second law of thermodynamics. The KullbackLeibler divergence is symmetric 0.5) and Beta(0.5, 3) that have equal differential entropy The symmetry condition: . for the skewed cases Beta(3,

102

follows from the above definitions and the mirror-symmetry : distribution.

enjoyed by the beta

Relationships between statistical measures


Mean, mode and median relationship If 1 < < then mode median mean.[16] Expressing the mode (only for >1 and >1), and the mean in terms of and :

If 1<< then the order of the inequalities are reversed. For > 1 and > 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values ofx. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of 1 and 1 (for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder"). For example, for = 1.0001 and = 1.00000001: mode = 0.9999; PDF(mode) = 1.00010 mean = 0.500025; PDF(mean) = 1.00003 median = 0.500035; PDF(median) = 1.00003 mean mode = 0.499875 mean median = 9.65538106

(where PDF stands for the value of the probability density function)

Beta distribution Mean, geometric mean and harmonic mean relationship It is known from the inequality of arithmetic and geometric means that the geometric mean is lower than the mean. Similarly, the harmonic mean is lower than the geometric mean. The accompanying plot shows that for = , both the mean and the median are exactly equal to 1/2, regardless of the value of = , and the mode is also equal to 1/2 for = > 1, however the geometric and harmonic means are lower than 1/2 and they only approach this value asymptotically as = .
:Mean, Median, Geometric Mean and Harmonic Mean for Beta distribution with 0 < = < 5

103

Kurtosis bounded by the square of the skewness As remarked by Feller,[12] in the Pearson system the beta probability density appears as type I (any difference between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of his paper [22] published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed.[31] The region occupied by the beta distribution is bounded by the following two lines in the (skewness2,kurtosis) plane, or the (skewness2,excess kurtosis) plane:

Beta distribution and parameters vs. excess Kurtosis and squared Skewness

or, equivalently,

(At a time when there were no powerful digital computers), Karl Pearson accurately computed further boundaries,[11][22] for example, separating the "U-shaped" from the "J-shaped" distributions. The lower boundary line (excess kurtosis + 2 - skewness2 = 0) is produced by skewed "U-shaped" beta distributions with both values of shape parameters and close to zero. The upper boundary line (excess kurtosis - (3/2) skewness2 = 0) is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. Karl Pearson showed [22] that this upper boundary line (excess kurtosis - (3/2) skewness2 = 0) is also the intersection with Pearson's distribution III, which has unlimited support in one direction (towards positive infinity), and can be bell-shaped or J-shaped. His son, Egon Pearson, showed [31] that the region (in the kurtosis/squared-skewness plane) occupied by the beta distribution (equivalently, Pearson's distribution I) as it approaches this boundary (excess kurtosis - (3/2) skewness2 = 0) is shared with the noncentral chi-squared distribution. Karl Pearson [32] (Pearson 1895, pp. 357, 360, 373376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis - (3/2) skewness2 = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chi-squared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chi-squared

Beta distribution distribution the excess kurtosis is 12 / k and the square of the skewness is 8/ k, hence (excess kurtosis - (3/2) skewness2 = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chi-squared distribution X ~ (k) is a special case of the gamma distribution, with parametrization X ~ (k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chi-squared distribution. An example of a beta distribution near the upper boundary (excess kurtosis - (3/2) skewness2 = 0) is given by = 0.1, =1000, for which the ratio (excess kurtosis)/(skewness2) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 - skewness2 = 0) is given by = 0.0001, =0.1, for which values the expression (excess kurtosis+2)/(skewness2) =1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both and approaching zero symmetrically, the excess kurtosis reaches its minimum value at -2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards). Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 - skewness = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"-shaped distributions for which parameters and approach zero and hence all the probability density is concentrated at each end: x = 0 and x = 1 with practically nothing in between them. Since for 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a 2-point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p and the other with probability q = 1 - p. For cases approaching this limit boundary with symmetry = , skewness 0, excess kurtosis -2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p q 1/2. For cases approaching this limit boundary with skewness, excess kurtosis - 2 + skewness2, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end and at the right end .
2

104

Symmetry
All statements are conditional on > 0 and > 0 Probability density function reflection symmetry

Cumulative distribution function reflection symmetry plus unitary translation

Mode reflection symmetry plus unitary translation

Median reflection symmetry plus unitary translation

Mean reflection symmetry plus unitary translation

Geometric Means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection (1-X)

Harmonic Means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection (1-X) .

Beta distribution Variance symmetry

105

Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection (1-X)

Geometric covariance symmetry

Mean absolute deviation around the mean symmetry

Skewness skew-symmetry

Excess kurtosis symmetry

Characteristic function symmetry of Real part (with respect to the origin of variable "t")

Characteristic function skew-symmetry of Imaginary part (with respect to the origin of variable "t")

Characteristic function symmetry of Absolute value (with respect to the origin of variable "t")

Differential Entropy symmetry

Relative Entropy (also called KullbackLeibler divergence) symmetry

Geometry of the probability distribution function


Inflection Points For certain values of the shape parameters and , the probability distribution function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution. Defining the following quantity:

Points of inflection occur,[8][10][13][14] depending on the value of the shape parameters and ,as follows: left of the mode at:

Inflection point location versus and showing showing regions with one inflection point

The distribution is bell-shaped (symmetric for = and skewed otherwise), with two inflection points, equidistant from the mode as follows:

Beta distribution

106

right of the mode at:

Inflection point location versus and showing region with two inflection points

The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode, as follows:

The distribution is unimodal, negatively skewed, left-tailed, with one inflection point, located to the left of the mode, as follows:

The distribution is unimodal, positively skewed, right-tailed, with one inflection point, located to the right of the mode, as follows:

The distribution has a mode at the left end x=0 and it is positively skewed, right-tailed. There is one inflection point, located to the right of the mode, at:

The distribution is unimodal negatively skewed, left-tailed, with one inflection point, located to the left of the mode, as follows:

The distribution has a mode at the right end x=1 and it is negatively skewed, left-tailed. There is one inflection point, located to the left of the mode, at:

There are no inflection points in the remaining (symmetric and skewed) regions: U-shaped: upside-down-U-shaped: , reverse-J-shaped or J-shaped:

The accompanying plots show the inflection point locations (shown vertically, ranging from 0 to 1) versus and (the horizontal axes ranging from 0 to 5). There are large cuts at surfaces intersecting the lines = 1, = 1, = 2, and = 2 because at these values the beta distribution change from 2 modes, to 1 mode to no mode.

Beta distribution Shapes The beta density function can take a wide variety of different shapes depending on the values of the two parameters and . The ability of the beta distribution to take this great diversity of shapes (using only two parameters) is partly responsible for finding wide application for modeling actual measurements: Symmetric plots). is a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1. a lower
[8] PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 30

107

the density function is symmetric about 1/2 (blue & teal

is U-shaped (blue plot).

PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 2

value than this is impossible for any distribution to reach. The differential entropy approaches a minimum value of The (negative anywhere else) differential entropy reaches its maximum value of zero is a semi-elliptic [0,1] distribution, see: Wigner semicircle distribution
PDF for skewed beta distribution vs. x and beta= 5.5 alpha from 0 to 9 [8]

is the arcsine distribution

is the uniform [0,1] distribution

PDF for skewed beta distribution vs. x and beta= 2.5 alpha from 0 to 9

is symmetric unimodal

Beta distribution is bell-shaped, with inflection points located to either side of the mode
PDF for skewed beta distribution vs. x and beta= 8 alpha from 0 to 10

108

is the parabolic [0,1] distribution

is a 1 point Degenerate distribution with a Dirac delta function spike at the midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the single point x = 1/2. The differential entropy approaches a minimum value of Skewed . left end, at x=0 is a straight line with slope -2, the right-triangular distribution with right angle at the is concave (maximum variance occurs for , or = the golden is reverse J-shaped with a right tail, positively skewed, strictly decreasing, convex is skewed unimodal (magenta & cyan plots). Positive skew for < , negative skew for > the density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve. is skewed U-shaped. Positive skew for < , negative skew for > .

ratio conjugate) is positively skewed, strictly decreasing (red plot), a reversed (mirror-image) power function [0,1] distribution

Beta distribution

109

is reverse J-shaped with a right tail, convex

is J-shaped with a left tail, negatively skewed, strictly increasing, convex

(maximum variance occurs for

, or = the golden

ratio conjugate) [13] is negatively skewed, strictly increasing (green plot), the power function [0,1] distribution

is concave

is a straight line with slope +2, the right-triangular distribution with right angle at the right end, at x=1

is J-shaped with a left tail, convex

Parameter estimation
Method of moments
Two unknown parameters Two unknown parameters ( of a beta distribution supported in the [0,1] interval) can be estimated, using the

method of moments, with the first two moments (sample mean and sample variance) as follows. Let:

be the sample mean estimate and

be the sample variance estimate. The method-of-moments estimates of the parameters are conditional on

conditional on

Beta distribution When the distribution is required over known interval other than random variable Y, then replace with and with with random variable X, say with

110

in the above couple of equations for the

shape parameters (see "Alternative parametrizations, four parameters" section below).[33], where:

Four unknown parameters All four parameters ( of a beta distribution supported in the

[a,c] interval -see section "Alternative parametrizations, Four parameters"-) can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).[34][35][8]. The excess kurtosis was expressed in terms of the square of the skewness, and the sample size = + , (see previous section titled "Kurtosis") as follows:
Solutions for parameter estimates vs. (sample) excess Kurtosis and (sample) squared Skewness Beta distribution

One can use this equation to solve for the sample size = + in terms of the square of the skewness and the excess kurtosis as follows[34]: This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson[22]) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see previous section titled "Kurtosis bounded by the square of the skewness"): The case of zero skewness, can be immediately solved because for zero skewness, = and hence = 2 = 2 , therefore = = /2

(Excess kurtosis is negative for the beta distribution with zero skewness, ranging from -2 to 0, so that

-and

therefore the sample shape parameters- is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches -2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero). For non-zero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters , the parameters can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters):

Beta distribution resulting in the following solution[34]: Where one should take the solutions as follows: for (negative) sample skewness < 0, and for

111

(positive) sample skewness > 0. The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric U-shaped for = < 1, uniform for = = 1, upside-down-U-shaped for 1 < = < 2 and bell-shaped for = >2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2 - skewness2 = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end and at the right end . The two surfaces become further apart towards the rear edge. At this

rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton [36], sampling in the neighborhood of the line (sample excess kurtosis - (3/2)(sample skewness)2 = 0) (the just-J-shaped portion of the rear edge where blue meets beige) , "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate = + becomes zero and hence approaches infinity as that line is approached. Bowman and Shenton [36] write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However the mean and standard deviation are fairly reliable." Therefore the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See section titled "Kurtosis bounded by the square of the skewness" for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis - (3/2)(sample skewness)2 = 0). As remarked by Karl Pearson himself [37] this issue may not be of much practical importance as this trouble arises only for very skewed J-shaped (or mirror-image J-shaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewed skewed-bell-shape distributions that occur in practice do not have this parameter estimation problem. The remaining two parameters variety of equations
[8][34]

can be determined using the sample mean and the sample variance using a based on the sample , the equation

. One alternative is to calculate the support interval range

variance and the sample kurtosis. For this purpose one can solve, in terms of the range and "Alternative parametrizations, four parameters"):

expressing the excess kurtosis in terms of the sample variance, and the sample size (see section titled "Kurtosis"

to obtain:

Another alternative is to calculate the support interval range skewness


[34]

based on the sample variance and the sample , the equation expressing the squared

. For this purpose one can solve, in terms of the range

skewness in terms of the sample variance, and the sample size (see section titled "Skewness" and "Alternative parametrizations, four parameters"):

to obtain[34]:

Beta distribution

112

The remaining parameter can be determined from the sample mean and the previously obtained parameters: :

and finally, of course,

In the above formulas one may take, for example, as estimates of the sample moments:

The estimators G1 for sample skewness and G2 for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to [38]) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study[38] concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and mean-squared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G1 and G2, had smaller mean-squared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill[38]).

Maximum likelihood
Two unknown parameters As it is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood estimates for the beta distribution do not have a general closed form solution for arbitrary values of the shape parameters. If are independent random variables each having a beta distribution, the joint log likelihood function for N iid observations is:
Max (Joint Log Likelihood/N) for Beta distribution Maxima at alpha=beta=2

Beta distribution

113

Max (Joint Log Likelihood/N) for Beta distribution Maxima at alpha=beta= {0.25,0.5,1,2,4,6,8}

Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:

where:

since the digamma function denoted

is defined as the logarithmic derivative of the gamma function:[20]

To ensure that the values with zero tangent slope are indeed a maximum (instead of a saddle-point or a minimum) one has to also satisfy the condition that the curvature is negative. This amounts to satisfying that the second partial derivative with respect to the shape parameters is negative

Beta distribution

114

using the previous equations, this is equivalent to:

where the trigamma function, denoted derivative of the digamma function:

, is the second of the polygamma functions, and is defined as the .

These conditions are equivalent to stating that the variances of the logarithmically transformed variables are positive, since:

Therefore the condition of negative curvature at a maximum is equivalent to the statements:

Alternatively, the condition of negative curvature at a maximum is also equivalent to stating that the following logarithmic derivatives of the geometric means and are positive, since:

(While these slopes are indeed positive: and

and

, the other slopes are negative:

. The slopes of the mean and the median with respect to and display

similar sign behavior.) From the condition that at a maximum, the partial derivative with respect to the shape parameter equals zero, we obtain the following system of coupled maximum likelihood estimate equations (for the average log-likelihoods) that needs to be inverted to obtain the (unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples
[8]

where we recognize

as the logarithm of the sample geometric mean and

as the logarithm of , it follows that

of the sample geometric mean based on (1-X), the mirror-image of X. For .

Beta distribution These coupled equations containing digamma functions of the shape parameter estimates numerical methods as done, for example, by Beckman et al. few cases.
[40] [39]

115 must be solved by , the

. Gnanadesikan et al. give numerical solutions for a may be used to obtain initial values for

N.L.Johnson and S.Kotz

[8]

suggest that for "not too small" shape parameter estimates

logarithmic approximation to the digamma function

an iterative solution, since the equations resulting from this approximation can be solved exactly:

which leads to the following solution for the initial values (of the estimate shape parameters in terms of the sample geometric means) for an iterative solution:

Alternatively, the estimates provided by the method of moments can instead be used as initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma functions. When the distribution is required over a known interval other than random variable Y, then replace second equation with in the first equation with with random variable X, say and replace with in the

(see "Alternative parametrizations, four parameters" section below).

If one of the shape parameters is known, the problem is considerably simplified. The following logit transformation can be used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric, both -equal- parameters are known when one is known):

This logit transformation is the logarithm of the transformation that divides the variable X by its mirror-image (X/(1 X) resulting in the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) with support . As previously discussed in the section "Moments of logarithmically-transformed random variables," the logit transformation , studied by Johnson[25],

extends the finite support [0,1] based on the original variable X to infinite support in both directions of the real line . If, for example, is known, the unknown parameter can be obtained in terms of the inverse[41] digamma

function of the right hand side of this equation:

In particular, if one of the shape parameters has a value of unity, for example for distribution with bounded support [0,1]), using the identity

(the power function in the equation is,[8] exactly:

, the maximum likelihood estimator for the unknown parameter

Beta distribution

116

The beta distribution has support [0,1], therefore

, and hence

, and therefore :

In conclusion, the maximum likelihood estimates of the shape parameters of a beta distribution are (in general) a complicated function of the sample geometric mean, and of the sample geometric mean based on (1-X), the mirror-image of X. One may ask, if the variance (in addition to the mean) is necessary to estimate two shape parameters with the method of moments, why is the (logarithmic or geometric) variance not necessary to estimate two shape parameters with the maximum likelihood method, for which only the geometric means suffice? The answer is because the mean does not provide as much information as the geometric mean. For a beta distribution with equal shape parameters = , the mean is exactly 1/2, regardless of the value of the shape parameters, and therefore regardless of the value of the statistical dispersion (the variance). On the other hand, the geometric mean of a beta distribution with equal shape parameters = , depends on the value of the shape parameters, and therefore it contains more information. Also, the geometric mean of a beta distribution does not satisfy the symmetry conditions satisfied by the mean, therefore, by employing both the geometric mean based on X and geometric mean based on (1-X), the maximum likelihood method is able to provide best estimates for both parameters = , without need of employing the variance. One can express the joint log likelihood per N iid observations in terms of the sufficient statistics (the sample geometric means) as follows:

We can plot the joint log likelihood per N observations for fixed values of the sample geometric means to see the behavior of the likelihood function as a function of the shape parameters and . In such a plot, the shape parameter estimators correspond to the maxima of the likelihood function. See the accompanying graph that shows that all the likelihood functions intersect at = = 1, which corresponds to the values of the shape parameters that give the maximum entropy (the maximum entropy occurs for shape parameters equal to unity: the uniform distribution). It is evident from the plot that the likelihood function gives sharp peaks for values of the shape parameter estimators close to zero, but that for values of the shape parameters estimators greater than one, the likelihood function becomes quite flat, with less defined peaks. Obviously, the maximum likelihood parameter estimation method for the beta distribution becomes less acceptable for larger values of the shape parameter estimators, as the uncertainty in the peak definition increases with the value of the shape parameter estimators. One can arrive at the same conclusion by noticing that the expression for the curvature of the likelihood function is in terms of the geometric variances

These variances (and therefore the curvatures) are much larger for small values of the shape parameter and . However, for shape parameter values >1, >1, the variances (and therefore the curvatures) flatten out. Equivalently, this result follows from the CramrRao bound, since the Fisher information matrix components for the beta distribution are these logarithmic variances. The CramrRao bound states that the variance of any unbiased estimator of is bounded by the reciprocal of the Fisher information:

Beta distribution

117

so the variance of the estimators increases with increasing and , as the logarithmic variances decrease. Also one can express the joint log likelihood per N iid observations in terms of the digamma function expressions for the logarithms of the sample geometric means as follows:

this expression is identical to the negative of the cross-entropy (see section on "Quantities of information (entropy)"). Therefore, finding the maximum of the joint log likelihood of the shape parameters, per N iid observations, is identical to finding the minimum of the cross-entropy for the beta distribution, as a function of the shape parameters.

with the cross-entropy defined as follows:

Four unknown parameters The procedure is similar to the one followed in the two unknown parameter case. If are independent

random variables each having a beta distribution with four parameters, the joint log likelihood function for N iid observations is:

Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:

these equations can be re-arranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood

Beta distribution estimates for the four parameters :

118

with sample geometric means:

The parameters

are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/N).

This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are well-defined only for , which precludes a maximum likelihood solution for shape parameters less than unity. N.L.Johnson and S.Kotz[8] ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of a, c, and are required, the above procedure [for the two unknown parameter case, with X transformed as X =(Y-a)/(c-a)] can be repeated using a succession of trial values of a and c, until the pair (a,c) for which maximum likelihood (given a and c) is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation).

Fisher information matrix


Let a random variable X have a probability density f(x;). The partial derivative with respect to the (unknown, and to be estimated) parameter of the log likelihood function is called the score. The second moment of the score is called the Fisher information:

The expectation of the score is zero, therefore the Fisher information is also the second moment centered around the mean of the score: the variance of the score. If the log likelihood function is twice differentiable with respect to the parameter , and under certain regularity conditions[42] , then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes):

Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter of the log likelihood function. Therefore Fisher information is a measure of the curvature of the log likelihood function of . A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a

Beta distribution Taylor's series approximation, taken as far as the quadratic terms[43] . The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The CramrRao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter :

119

The precision to which one can estimate the estimator of a parameter is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter[44]. When there are N parameters , then the Fisher information takes the form of an NxN positive

semidefinite symmetric matrix, the Fisher Information Matrix, with typical element:

Under certain regularity conditions[42], the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation:

With

iid random variables, an N-dimensional "box" can be constructed with sides . Costa and Cover[45] show that the (Shannon) differential entropy h(X) is related to the volume of

the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set. Two parameters For independent random variables each having a beta distribution parametrized with shape

parameters and , the joint log likelihood function for N iid observations is:

therefore the joint log likelihood function per N iid observations is:

For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 off-diagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal). Aryal and Nadarajah[46] calculated Fisher's information matrix for the four parameter case, from which the two parameter case can be obtained as follows:

Since the Fisher information matrix is symmetric:

Beta distribution The Fisher information components are equal to the log geometric variances and log geometric covariance. Therefore they can be expressed as trigamma functions, denoted , the second of the polygamma functions, defined as the derivative of the digamma function: .

120

These derivatives are also derived in the section titled "Parameter estimation", "Maximum likelihood", "Two unknown parameters," and plots of the log likelihood function are also shown in that section. The section titled "Geometric variance and covariance" contains plots and further discussion of the Fisher information matrix components: the log geometric variances and log geometric covariance as a function of the shape parameters and . The section titled "Other moments", "Moments of transformed random variables", "Moments of logarithmically-transformed random variables" contains formulas for moments of logarithmically-transformed random variables. Images for the Fisher information components , and are shown in the section titled "Geometric variance". Four parameters If are independent random variables each having a

beta distribution with four parameters: the exponents and , as well as "a" (the minimum of the distribution range), and "c" (the maximum of the distribution range) (section titled "Alternative parametrizations", "Four parameters"), with probability density function:

Fisher Information I(a,a) for = vs range (c-a) and exponent =

Fisher Information I(,a) for =, vs. range (c a) and exponent =

the joint log likelihood function per N iid observations is:

For the four parameter case, the Fisher information has 4*4=16 components. It has 12 off-diagonal components = (4*4 total - 4 diagonal). Since the Fisher information matrix is symmetric, half of these components (12/2=6) are independent. Therefore the Fisher information matrix has 6 independent off-diagonal + 4 diagonal = 10 independent components. Aryal and Nadarajah[46] calculated Fisher's information matrix for the four parameter case as follows:

Beta distribution

121

In the above expressions, the use of X instead of Y in the expressions

is not an error. The

expressions in terms of the log geometric variances and log geometric covariance occur as functions of the two parameter parametrization because when taking the partial derivatives with respect to the exponents ( , ) in the four parameter case, one obtains the identical expressions as for the two parameter case: these terms of the four parameter Fisher information matrix are independent of the minimum "a" and maximum "c" of the distribution's range. The only non-zero term upon double differentiation of the log likelihood function with respect to the exponents and is the second derivative of the log of the beta function: . This term is independent of the minimum "a" and maximum "c" of the distribution's range. Double differentiation of this term results in trigamma functions. The sections titled "Maximum likelihood", "Two unknown parameters" and "Four unknown parameters" also show this fact. The Fisher information for N i.i.d. samples is N times the individual Fisher information (eq. 11.279, page 394 of Cover and Thomas[28]). (Aryal and Nadarajah[46] take a single observation, N=1, to calculate the following components of the Fisher information, which leads to the same result as considering the derivatives of the log likelihood per N observations):

(In the above expression the erroneous expression for

in Aryal and Nadarajah[46] has been corrected.)

The lower two diagonal entries of the Fisher information matrix, with respect to the parameter "a" (the minimum of the distribution's range): , and with respect to the parameter "c" (the maximum of the distribution's range): are only defined for exponents > 2 and > 2 respectively. The Fisher information matrix component for the minimum "a" approaches infinity for exponent approaching 2 from above, and the Fisher information matrix component for the maximum "c" approaches infinity for exponent approaching 2 from above. The Fisher information matrix for the four parameter case does not depend on the individual values of the minimum "a" and the maximum "c", but only on the total range (c - a). Moreover, the components of the Fisher information matrix that depend on the range (c - a), depend only through its inverse (or the square of the inverse), such that the Fisher information decreases for increasing range (c - a). The accompanying images show the Fisher information components information components and and . Images for the Fisher

are shown in the section titled "Geometric variance". All these Fisher

information components look like a basin, with the "walls" of the basin being located at low values of the

Beta distribution parameters. The following four-parameter-beta-distribution Fisher information components can be expressed in terms of the two-parameter : expectations of the transformed ratio ((1-X)/X) and of its mirror image (X/(1-X)), scaled by the range (c-a), which may be helpful for interpretation:

122

These are also the expected values of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) [8] and its mirror image, scaled by the range (c-a). Also, the following Fisher information components can be expressed in terms of the harmonic (1/X) variances or of variances based on the ratio transformed variables ((1-X)/X) as follows:

See section "Moments of linearly-transformed, product and inverted random variables" for these expectations.

Generating beta-distributed random variates


If and are independent, with and
[47]

then

, so one )

algorithm for generating beta variates is to generate X/(X+Y), where X is a gamma variate with parameters ( and Y is an independent gamma variate with parameters ( ). Also, the kth order statistic of uniformly distributed variates is and are small integers is to generate uniform variates and choose the , so an alternative if -th largest.[48]

Related distributions
Transformations
If If second kind". If distribution If
[49] [2]

then then then

mirror-image symmetry . The beta prime distribution, also called "beta distribution of the (assuming n>0 and m>0). The Fisher-Snedecor F then where PERT denotes a distribution used in PERT

analysis, and m=most likely value. Traditionally in PERT analysis. If then Kumaraswamy distribution with parameters If If then then Kumaraswamy distribution with parameters

Beta distribution

123

Special and limiting cases


If and the standard uniform distribution. then Wigner semicircle distribution. is equivalent to the arcsine distribution. This distribution is also Jeffreys prior probability for an unknown proportion between 0 and 1. The arcsine probability density is a distribution that appears in several random walk fundamental theorems. In a fair coin toss random walk, the probability for the time of the last visit to the origin is distributed as an (U-shaped) arcsine distribution[50] [12]. In a two-player fair-coin-toss game, a player is said to be in the lead if the random walk (that started at the origin) is above the origin. The most probable number of times that a given player will be in the lead, in a game of length 2N, is not N. On the contrary, N is the least likely number of times that the player will be in the lead. The most likely number of times in the lead is 0 or 2N (following the arcsine distribution). distribution the gamma distribution the exponential

: The arcsine distribution probability density was proposed by Harold Jeffreys to represent uncertainty for a proportion between 0 and 1 in Bayesian inference, and is now commonly referred to as Jeffreys prior: p1/2(1p)1/2. This distribution also appears in several random walk fundamental theorems

Derived from other distributions


The kth order statistic of a sample of size n from the uniform distribution is a beta random variable,
[48]

If If If

and and and

then then then .

Example of eight random walks in one dimension starting at 0: the probability for the time of the last visit to the origin is distributed as

The power function distribution.

Combination with other distributions


and then for all x>0.

Compounding with other distributions


If If and and then then beta-binomial distribution beta negative binomial distribution

Generalisations
The Dirichlet distribution is a multivariate generalization of the beta distribution. Univariate marginals of the Dirichlet distribution have a beta distribution. The Pearson type I distribution is identical to the beta distribution (except for arbitrary shifting and re-scaling that can also be accomplished with the four parameter parametrization of the beta distribution). the noncentral beta distribution

Beta distribution

124

Applications
Order statistics
The beta distribution has an important application in the theory of order statistics. A basic result is that the distribution of the k'th largest of a sample of size n from a continuous uniform distribution has a beta distribution.[48] This result is summarized as:

From this, and application of the theory related to the probability integral transform, the distribution of any individual order statistic from any continuous distribution can be derived.[48]

Rule of succession
A classic application of the beta distribution is the rule of succession, introduced in the 18th century by Pierre-Simon Laplace in the course of treating the sunrise problem. It states that, given s successes in n conditionally independent Bernoulli trials with probability p, that p should be estimated as . This estimate may be regarded as the

expected value of the posterior distribution over p, namely Beta(s+1,ns+1), which is given by Bayes' rule if one assumes a uniform prior probability over p (i.e., Beta(1,1)) and then observes that p generated s successes in n trials.

Bayesian inference
The use of Beta distributions in Bayesian inference is due to the fact that they provide a family of conjugate prior probability distributions for binomial (including Bernoulli) and geometric distributions. The domain of the beta distribution can be viewed as a probability, and in fact the beta distribution is often used to describe the distribution of a probability value p[24]: . Examples

of beta distributions used as prior probabilities to represent ignorance of parameter values in Bayesian inference are Beta(0,0), Beta(1/2,1/2) and Beta(1,1). Haldane's prior probability ( Beta(0,0) )

: The uniform distribution probability density was proposed by Thomas Bayes to represent ignorance of prior probabilities in Bayesian inference

As the beta distribution approaches the limit Beta(0,0), it approaches a 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end, at 0 and 1. A fair-coin-toss: one face of the coin being at 0 and the other face being at 1. The Beta(0,0) distribution was proposed by J.B.S. Haldane[51], who suggested that the prior probability representing complete uncertainty should be proportional to p1(1p)1. The Haldane prior probability distribution Beta(0,0) is an "improper prior" because its integration (from 0 to 1) fails to converge to 1. Zellner[52] points out that on the log-odds scale, (the logit transformation ln(p/(1-p))), the Haldane prior is a uniformly flat prior. Jeffreys' prior probability ( Beta(1/2,1/2) ) Harold Jeffreys[53] proposed to use a prior probability measure that should be invariant under reparameterization: proportional to the square root of the determinant of Fisher's information matrix. For the Bernoulli distribution, this can be shown as follows: for a coin that is "heads" with probability p[0,1] and is "tails" with probability 1p, for a given (H,T){(0,1),(1,0)} the probability is . It follows that the log likelihood is . The Fisher information matrix has only one component (it is a scalar,

Beta distribution because there is only one parameter: p), therefore:

125

Thus, Jeffreys prior is proportional to p1/2(1p)1/2, the arcsine distribution Beta(1/2,1/2): Bayes' prior probability ( Beta(1,1) ) The beta distribution achieves maximum differential entropy for Beta(1,1): the uniform probability density, for which all values in the domain of the distribution have equal density, which was proposed by Thomas Bayes[54] as the prior probability distribution to express ignorance about the correct prior distribution. If there is sufficient sampling data, the three priors of Bayes (Beta(1,1)), Jeffreys (Beta(1/2,1/2)) and Haldane (Beta(0,0)) should yield similar posterior probability densities. Otherwise, as Gelman et al. ([55] , p. 65) point out, "if so few data are available that the choice of noninformative prior distribution makes a difference, one should put relevant information into the prior distribution." The beta distribution is the special case of the multi-dimensional Dirichlet distribution with only two parameters, and the beta distribution is conjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution is conjugate to the multinomial distribution and categorical distribution. In Bayesian inference, the beta distribution can be derived as the posterior probability of the parameter p of a binomial distribution after observing 1 successes (with probability p of success) and 1 failures (with probability 1p of failure). Another way to express this is that placing a prior distribution of Beta(,) on the parameter p of a binomial distribution is equivalent to adding pseudo-observations of "success" and pseudo-observations of "failure" to the actual number of successes and failures observed, then estimating the parameter p by the proportion of successes over both real- and pseudo-observations. If and are greater than 0, this has the effect of smoothing out the distribution of the parameters by ensuring that some positive probability mass is assigned to all parameters even when no actual observations corresponding to those parameters is observed. Values of and less than 1 favor sparsity, i.e. distributions where the parameter p is close to either 0 or 1. In effect, and , when operating together, function as a concentration parameter; see that article for more details.

Subjective logic
In standard logic, propositions are considered to be either true or false. In contradistinction, subjective logic assumes that humans cannot determine with absolute certainty whether a proposition about the real world is absolutely true or false. In subjective logic the posteriori probability estimates of binary events can be represented by beta distributions.[56]

Wavelet analysis
A wavelet is a wave-like oscillation with an amplitude that starts out at zero, increases, and then decreases back to zero. It can typically be visualized as a "brief oscillation" that promptly decays. Wavelets can be used to extract information from many different kinds of data, including but certainly not limited to audio signals and images. Thus, wavelets are purposefully crafted to have specific properties that make them useful for signal processing. Wavelets are localized in both time and frequency whereas the standard Fourier transform is only localized in frequency. Therefore, standard Fourier Transforms are only applicable to stationary processes, while wavelets are applicable to non-stationary processes. Continuous wavelets can be constructed based on the beta distribution. Beta wavelets[57] can be viewed as a soft variety of Haar wavelets whose shape is fine-tuned by two shape parameters and .

Beta distribution

126

Project Management: task cost and schedule modeling


The beta distribution can be used to model events which are constrained to take place within an interval defined by a minimum and maximum value. For this reason, the beta distribution along with the triangular distribution is used extensively in PERT, critical path method (CPM), Joint Cost Schedule Modeling (JCSM) and other project management / control systems to describe the time to completion and the cost of a task. In project management, shorthand computations are widely used to estimate the mean and standard deviation of the beta distribution[2]:

where a is the minimum, c is the maximum, and b is the most likely value (the mode for >1 and >1). The above estimate for the mean is known as the PERT three-point estimation and it is

exact for either of the following values of (for arbitrary within these ranges): for > 1 (symmetric case) with standard deviation excess kurtosis = , skewness = 0, and

or for 5 > > 1 (skewed case) with standard deviation ,

skewness =

, and excess kurtosis =

Beta distribution

127

The above estimate for the standard deviation :

is exact for either of the following values of and

(symmetric) with skewness = 0, and excess kurtosis = or and or and (left-tailed, negative skew) with skewness = , and excess kurtosis=0 (right-tailed, positive skew) with skewness = , and excess kurtosis = 0

Beta distribution Otherwise, these can be poor approximations for beta distributions with other values of and . For example, the particular values and resulting in and

128

have been shown to exhibit average errors of 40% in the mean and 549% in the variance[58][59][60]

Alternative parametrizations
Two parameters
Mean and sample size The beta distribution may also be reparameterized in terms of its mean (0 < < 1) and sample size = + ( > 0). This is useful in Bayesian parameter estimation if one wants to place an unbiased (uniform) prior probability over the mean. For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 1) is drawn from a population-level Beta distribution, then an important statistic is the mean of this population-level distribution. The mean and sample size parameters are related to the shape parameters and via[61]

Under this parametrization, one can place a uniform prior probability over the mean, and a vague prior probability (such as an exponential or gamma distribution) over the positive reals for the sample size. Mean (allele frequency) and (Wright's) genetic distance between two populations The BaldingNichols model[1] is a two-parameter parametrization of the beta distribution used in population genetics. It is a statistical description of the allele frequencies in the components of a sub-divided population: See the articles BaldingNichols model, F-statistics, fixation index and coefficient of relationship, for further information. Mean and variance Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance of the beta distribution in terms of the original parameters and , one can express the and parameters in terms of the mean () and the variance (var):

This parametrization of the beta distribution may lead to a more intuitive understanding (than the one based on the original parameters and ), for example, by expressing the mode, skewness, excess kurtosis and differential entropy in terms of the mean and the variance:

Beta distribution

129

Four parameters
A beta distribution with the two shape parameters and is supported on the range [0,1]. It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution,[8] by a linear transformation substituting the non-dimensional variable x in terms of the new variable y (with support [a,c]) and the parameters a and c:

The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (c-a), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows:

The measures of central location are scaled (by (c-a)) and shifted (by a), as follows:

The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered around the mean) by the range (c-a), linearly for the mean deviation and nonlinearly for the variance:

Since the skewness and excess kurtosis are non-dimensional quantities (as moments centered around the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1]):

Beta distribution

130

History
The first systematic, modern discussion of the beta distribution is probably due to Karl Pearson FRS[62] (27 March 1857 27 April 1936[63]), an influential English mathematician who has been credited with establishing the discipline of mathematical statistics.[64] In Pearson's papers[32][22] the beta distribution is couched as a solution of a differential equation: Pearson's Type I distribution. The beta distribution is essentially identical to Pearson's Type I distribution except for arbitrary shifting and re-scaling (the beta and Pearson Type I distributions can always be equalized by proper choice of parameters). In fact, in several English books and journal articles in the few decades prior to World War II, it was common to refer to the beta distribution as Pearson's Type I distribution. William P. Elderton (18771962) in his 1906 monograph "Frequency curves and correlation"[34] further analyzes the beta distribution as Pearson's Type I distribution, Karl Pearson analyzed the beta distribution as the solution "Type I" of Pearson including a full discussion of the method of distributions moments for the four parameter case, and diagrams of (what Elderton describes as) U-shaped, J-shaped, twisted J-shaped, "cocked-hat" shapes, horizontal and angled straight-line cases. Elderton wrote "I am chiefly indebted to Professor Pearson, but the indebtedness is of a kind for which it is impossible to offer formal thanks." Elderton in his 1906 monograph [34] provides an impressive amount of information on the beta distribution, including equations for the origin of the distribution chosen to be the mode, as well as for other Pearson distributions: types I through VII. Elderton also included a number of appendixes, including one appendix ("II") on the beta and gamma functions. In later editions, Elderton added equations for the origin of the distribution chosen to be the mean, and analysis of Pearson distributions VIII through XII. As remarked by Bowman and Shenton [36] "Fisher and Pearson had a difference of opinion in the approach to (parameter) estimation, in particular relating to (Pearson's method of) moments and (Fisher's method of) maximum likelihood in the case of the Beta distribution." Also according to Bowman and Shenton, "the case of a Type I (beta distribution) model being the center of the controversy was pure serendipity. A more difficult model of 4 parameters would have been hard to find." Ronald Fisher (17 February 1890 29 July 1962) was one of the giants of statistics in the first half of the 20th century, and his long running public conflict with Karl Pearson can be followed in a number of articles in prestigious journals. For example, concerning the estimation of the four parameters for the beta distribution, and Fisher's criticism of Pearson's method of moments as being arbitrary, see Pearson's article "Method of moments and method of maximum likelihood" [37] (published three years after his retirement from University College, London, where his position had been divided between Fisher and Pearsons son Egon) in which Pearson writes "I read (Koshai's paper in the Journal of the Royal Statistical Society, 1933) which as far as I am aware is the only case at present published of the application of Professor Fisher's method. To my astonishment that method depends on first working out the constants of the frequency curve by the (Pearson) Method of Moments and then superposing on it, by what Fisher terms "the Method of Maximum Likelihood" a further approximation to obtain,

Beta distribution what he holds, he will thus get, "more efficient values" of the curve constants." David and Edwards's treatise on the history of statistics[65] cites the first modern treatment of the beta distribution, in 1911,[66] using the beta designation that has become standard, due to Corrado Gini,(May 23, 1884 March 13, 1965), an Italian statistician, demographer, and sociologist, who developed the Gini coefficient. N.L.Johnson and S.Kotz, in their comprehensive and very informative monograph[67] on leading historical personalities in statistical sciences credit Corrado Gini[68] as "an early Bayesian...who dealt with the problem of eliciting the parameters of an initial Beta distribution, by singling out techniques which anticipated the advent of the so called empirical Bayes approach." Bayes, in a posthumous paper [54] published in 1763, obtained a beta distribution as the density of the probability of success in Bernoulli trials (see the section titled "Applications, Bayesian inference" in this article), but the paper does not analyze any of the moments of the beta distribution or discuss any of its properties.

131

Citations
[1] Balding, DJ; Nichols, RA (1995). "A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity." (http:/ / www. springerlink. com/ content/ u27738g2626601p1/ ). Genetica (Springer) 96: 3-12. doi:10.1007/BF01441146. PMID7607457. . [2] Malcolm, D. G.; Roseboom, C. E., Clark, C. E. and Fazar, W., (1959). "Application of a technique for research and development program evaluation". Operations Research 7: 646649. [3] Oguamanam, D.C.D.; Martin, H.R. , Huissoon, J.P. (1995). "On the application of the beta distribution to gear damage analysis". Applied Acoustics 45 (3): 247261. doi:10.1016/0003-682X(95)00001-P. [4] Sulaiman, M.Yusof; W.M Hlaing Oo, Mahdi Abd Wahab, Azmi Zakaria (December 1999). "Application of beta distribution model to Malaysian sunshine data". Renewable Energy 18 (4): 573579. doi:10.1016/S0960-1481(99)00002-6. [5] Haskett, Jonathan D.; Yakov A. Pachepsky, Basil Acock (1995). "Use of the beta distribution for parameterizing variability of soil properties at the regional level for crop yield estimation". Agricultural Systems 48 (1): 7386. doi:10.1016/0308-521X(95)93646-U. [6] Gullco, Robert S.; Malcolm Anderson (December 2009). "Use of the Beta Distribution To Determine Well-Log Shale Parameters". SPE Reservoir Evaluation & Engineering 12 (6): 929942. doi:10.2118/106746-PA. [7] Wiley, James A.; Stephen J. Herschkorn, Nancy S. Padian (January 1989). "Heterogeneity in the probability of HIV transmission per sexual contact: The case of male-to-female transmission in penilevaginal intercourse" (http:/ / onlinelibrary. wiley. com/ doi/ 10. 1002/ sim. 4780080110/ abstract). Statistics in Medicine 8 (1): 93102,. . [8] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1995). "Chapter 21:Beta Distributions". Continuous Univariate Distributions Vol. 2 (2nd ed.). Wiley. ISBN978-0-471-58494-0. [9] Keeping, E. S. (2010). Introduction to Statistical Inference. Dover Publications. pp.462 pages. ISBN978-0486685021. [10] Wadsworth, George P. and Joseph Bryan (1960). Introduction to Probability and Random Variables. McGraw-Hill. pp.101. [11] Hahn, Gerald J. and S. Shapiro (1994). Statistical Models in Engineering (Wiley Classics Library). Wiley-Interscience. pp.376. ISBN978-0471040651. [12] Feller, William (1971). An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley. pp.669. ISBN978-0471257097. [13] Gupta (Editor), Arjun K. (2004). Handbook of Beta Distribution and Its Applications. CRC Press. pp.42. ISBN978-0824753962. [14] Panik, Michael J (2005). Advanced Statistics from an Elementary Point of View. Academic Press. pp.273. ISBN978-0120884940. [15] Rose, Colin, and Murray D. Smith (2002). Mathematical Statistics with MATHEMATICA. Springer. pp.496 pages. ISBN978-0387952345. [16] Kerman J (2011) "A closed-form approximation for the median of the beta distribution". arXiv:1111.0433v1 [17] Philip J. Fleming and John J. Wallace. How not to lie with statistics: the correct way to summarize benchmark results. Communications of the ACM, 29(3):218221, March 1986. [18] Liang, Zhiqiang; Jianming Wei, Junyu Zhao, Haitao Liu, Baoqing Li, Jie Shen and Chunlei Zheng (27 August 2008). "The Statistical Meaning of Kurtosis and Its New Application to Identification of Persons Based on Seismic Signals". Sensors 8: 51065119. doi:10.3390/s8085106. [19] Kenney, J. F., and E. S. Keeping (1951). Mathematics of Statistics Part Two, 2nd edition. D. Van Nostrand Company Inc. pp.429. [20] Abramowitz, Milton and Irene A. Stegun (1965). Handbook Of Mathematical Functions With Formulas, Graphs, And Mathematical Tables. Dover. pp.1046 pages. ISBN78-0486612720. [21] Weisstein., Eric W.. "Kurtosis" (http:/ / mathworld. wolfram. com/ Kurtosis. html). MathWorld--A Wolfram Web Resource. . Retrieved 13 August 2012. [22] Pearson, Karl (1916). "Mathematical contributions to the theory of evolution, XIX: Second supplement to a memoir on skew variation". Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 216 (538548): 429457. Bibcode1916RSPTA.216..429P. doi:10.1098/rsta.1916.0009. JSTOR91092. [23] Gradshteyn, I. S. , and I. M. Ryzhik (2000). Table of Integrals, Series, and Products, 6th edition. Academic Press. pp.1163. ISBN978-0122947575.

Beta distribution
[24] MacKay, David (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press; First Edition edition. pp.640. ISBN978-0521642989. [25] Johnson, N.L. (1949). "Systems of frequency curves generated by methods of translation". Biometrika 36: 149176. [26] A. C. G. Verdugo Lazo and P. N. Rathie. "On the entropy of continuous probability distributions," IEEE Trans. Inf. Theory, IT-24:120122, 1978. [27] Shannon, Claude E., "A Mathematical Theory of Communication," Bell System Technical Journal, 27 (4):623656,1948. PDF (http:/ / www. alcatel-lucent. com/ bstj/ vol27-1948/ articles/ bstj27-4-623. pdf) [28] Cover, Thomas M. and Joy A. Thomas (2006). Elements of Information Theory 2nd Edition (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience; 2 edition. pp.776. ISBN978-0471241959. [29] Plunkett, Kim, and Jeffrey Elman (1997). Exercises in Rethinking Innateness: A Handbook for Connectionist Simulations (Neural Network Modeling and Connectionism). p. 166: A Bradford Book. pp.313. ISBN978-0262661058. [30] Nallapati, Ramesh (2006). The smoothed dirichlet distribution: understanding cross-entropy ranking in information retrieval (http:/ / maroo. cs. umass. edu/ pub/ web/ getpdf. php?id=679). Ph.D. thesis: Computer Science Dept., University of Massachusetts Amherst. pp.113. . [31] Pearson, Egon S. (July 1969). "Some historical reflections traced through the development of the use of frequency curves" (http:/ / www. smu. edu/ Dedman/ Academics/ Departments/ Statistics/ Research/ TechnicalReports). THEMIS Statistical Analysis Research Program, Technical Report 38 Office of Naval Research, Contract N000014-68-A-0515 (Project NR 042-260): 23. . [32] Pearson, Karl (1895). "Contributions to the mathematical theory of evolution, II: Skew variation in homogeneous material". Philosophical Transactions of the Royal Society of LondonARRAY 186: 343414. Bibcode1895RSPTA.186..343P. doi:10.1098/rsta.1895.0010. JSTOR90649. [33] Engineering Statistics Handbook (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda366h. htm) [34] Elderton, William Palin (1906). Frequency-Curves and Correlation (http:/ / ia700401. us. archive. org/ 2/ items/ frequencycurvesc00elderich/ frequencycurvesc00elderich. pdf). Charles and Edwin Layton (London). pp.172. . [35] Elderton, William Palin and Norman Lloyd Johnson (2009). Systems of Frequency Curves. Cambridge University Press. pp.228. ISBN978-0521093361. [36] Bowman, K. O.; Shenton, L. R. (2007). "The beta distribution, moment method, Karl Pearson and R.A. Fisher" (http:/ / www. csm. ornl. gov/ ~bowman/ fjts232. pdf). Far East J.Theo.Stat. 23 (2): 133-164. . [37] Pearson, Karl (June 1936). "Method of moments and method of maximum likelihood" (http:/ / www. jstor. org/ discover/ 10. 2307/ 2334123?uid=3739776& uid=2134& uid=2& uid=70& uid=4& uid=3739256& sid=21101030309093). Biometrika 28 (1/2). . [38] Joanes, D. N.; C. A. Gill (1998). "Comparing measures of sample skewness and kurtosis". The Statistician 47 (Part 1): 183-189. [39] Beckman, R. J.; G. L. Tietjen (1978). "Maximum likelihood estimation for the beta distribution". Journal of Statistical Computation and Simulation 7 (3-4): 253258. doi:10.1080/00949657808810232. [40] Gnanadesikan, R.,Pinkham and Hughes (1967). "Maximum likelihood estimation of the parameters of the beta distribution from smallest order statistics". Technometrics 9: 607620. [41] Fackler, Paul. "Inverse Digamma Function (Matlab)" (http:/ / hips. seas. harvard. edu/ content/ inverse-digamma-function-matlab). Harvard University School of Engineering and Applied Sciences. . Retrieved 2012-08-18. [42] Silvey, S.D. (1975). Statistical Inference. page 40: Chapman and Hal. pp.192. ISBN978-0412138201. [43] Edwards, A. W. F. (1992). Likelihood. The Johns Hopkins University Press. pp.296. ISBN978-0801844430. [44] Jaynes, E.T. (2003). Probability theory, the logic of science. Cambridge University Press. pp.758. ISBN978-0521592710. [45] Costa, Max, and Cover, Thomas (1983 (September)). On the similarity of the entropy power inequality and the Brunn Minkowski inequality (http:/ / statistics. stanford. edu/ ~ckirby/ techreports/ NSF/ COV NSF 48. pdf). Tech.Report 48, Dept. Statistics, Stanford University. pp.14. . [46] Aryal, Gokarna; Saralees Nadarajah (2004). "Information matrix for beta distributions" (http:/ / www. math. bas. bg/ serdica/ 2004/ 2004-513-526. pdf). Serdica Mathematical Journal (Bulgarian Academy of Science) 30: 513526. . [47] van der Waerden, B. L., "Mathematical Statistics", Springer, ISBN 978-3-540-04507-6. [48] David, H. A., Nagaraja, H. N. (2003) Order Statistics (3rd Edition). Wiley, New Jersey pp 458. ISBN 0-471-38926-9 [49] Herreras-Velasco, Jos Manuel and Herreras-Pleguezuelo, Rafael and Ren van Dorp, Johan. (2011). Revisiting the PERT mean and Variance. European Journal of Operational Research (210), p. 448451. [50] Feller, William (1968). An Introduction to Probability Theory and Its Applications, Vol. 1, 3rd Edition. pp.509. ISBN978-0471257080. [51] Haldane, J.B.S. (1932). "A note on inverse probability" (http:/ / journals. cambridge. org/ action/ displayAbstract?aid=1733860). Mathematical Proceedings of the Cambridge Philosophical Society 28: 5561. . [52] Zellner, Arnold (1971). An Introduction to Bayesian Inference in Econometrics. Wiley-Interscience. pp.448. ISBN978-0471169376. [53] Jeffreys, Harold (1998). Theory of Probability. Oxford University Press, 3rd edition. pp.470. ISBN978-0198503682. [54] Bayes, Thomas; communicated by Richard Price (1763). "An Essay towards solving a Problem in the Doctrine of Chances" (http:/ / www. jstor. org/ stable/ 10. 2307/ 105741). Philosophical Transactions of the Royal Society of London 53: 370-418. . [55] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2003). Bayesian Data Analysis. Chapman and Hall/CRC. pp.696. ISBN978-1584883883. [56] A. Jsang. A Logic for Uncertain Probabilities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. 9(3), pp.279-311, June 2001. PDF (http:/ / www. unik. no/ people/ josang/ papers/ Jos2001-IJUFKS. pdf) [57] H.M. de Oliveira and G.A.A. Arajo,. Compactly Supported One-cyclic Wavelets Derived from Beta Distributions. Journal of Communication and Information Systems. vol.20, n.3, pp.27-33, 2005.

132

Beta distribution
[58] Keefer, Donald L. and Verdini, William A. (1993). Better Estimation of PERT Activity Time Parameters. Management Science 39(9), p. 10861091. [59] Keefer, Donald L. and Bodily, Samuel E. (1983). Three-point Approximations for Continuous Random variables. Management Science 29(5), p. 595609. [60] DRMI Newsletter, Issue 12, April 8, 2005 (http:/ / www. nps. edu/ drmi/ docs/ 1apr05-newsletter. pdf) [61] Kruschke, J. (2011). Doing Bayesian data analysis: A tutorial with R and BUGS. Academic Press / Elsevier ISBN 978-0123814852 (p. 83) [62] Yule, G. U.; Filon, L. N. G. (1936). "Karl Pearson. 1857-1936". Obituary Notices of Fellows of the Royal Society 2 (5): 72. doi:10.1098/rsbm.1936.0007. JSTOR769130. [63] "Library and Archive catalogue" (http:/ / www2. royalsociety. org/ DServe/ dserve. exe?dsqIni=Dserve. ini& dsqApp=Archive& dsqCmd=Show. tcl& dsqDb=Persons& dsqPos=0& dsqSearch=((text)=' Pearson: Karl (1857 - 1936) '))). Sackler Digital Archive. Royal Society. . Retrieved 2011-07-01. [64] "Karl Pearson sesquicentenary conference" (http:/ / www. economics. soton. ac. uk/ staff/ aldrich/ KP150. htm). Royal Statistical Society. 2007-03-03. . Retrieved 2008-07-25. [65] David, H. A. and A.W.F. Edwards (2001). Annotated Readings in the History of Statistics. Springer; 1 edition. pp.252. ISBN978-0387988443. [66] Gini, Corrado (1911). Studi Economico-Giuridici della Universit de Cagliari Anno III (reproduced in Metron 15, 133,171, 1949): 541. [67] Johnson (Editor), Norman L. and Samuel Kotz (1997). Leading Personalities in Statistical Sciences: From the Seventeenth Century to the Present (Wiley Series in Probability and Statistics. Wiley. pp.432. ISBN978-0471163817. [68] Metron journal.. "Biography of Corrado Gini" (http:/ / www. metronjournal. it/ storia/ ginibio. htm). Metron Journal. . Retrieved 2012-08-18.

133

External links
Weisstein, Eric W., " Beta Distribution (http://mathworld.wolfram.com/BetaDistribution.html)" from MathWorld. "Beta Distribution" (http://demonstrations.wolfram.com/BetaDistribution/) by Fiona Maclachlan, the Wolfram Demonstrations Project, 2007. Beta Distribution Overview and Example (http://www.xycoon.com/beta.htm), xycoon.com Beta Distribution (http://www.brighton-webs.co.uk/distributions/beta.htm), brighton-webs.co.uk

134

Continuous Distributions on (-Inf, Inf)


Normal distribution
Probability density function

The red curve is the standard normal distribution Cumulative distribution function

Notation Parameters R mean (location) 2 > 0 variance (squared scale) xR

Support PDF CDF Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF Fisher information

0 0

Normal distribution In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution that has a bell-shaped probability density function, known as the Gaussian function or informally as the bell curve:[1]

135

The parameter is the mean or expectation (location of the peak) and 2 is the variance. is known as the standard deviation. The distribution with = 0 and 2 = 1 is called the standard normal distribution or the unit normal distribution. A normal distribution is often used as a first approximation to describe real-valued random variables that cluster around a single mean value. The normal distribution is considered the most prominent probability distribution in statistics. There are several reasons for this:[2] First, the normal distribution arises from the central limit theorem, which states that under mild conditions, the mean of a large number of random variables drawn from the same distribution is distributed approximately normally, irrespective of the form of the original distribution. This gives it exceptionally wide application in, for example, sampling. Secondly, the normal distribution is very tractable analytically, that is, a large number of results involving this distribution can be derived in explicit form. For these reasons, the normal distribution is commonly encountered in practice, and is used throughout statistics, natural sciences, and social sciences[3] as a simple model for complex phenomena. For example, the observational error in an experiment is usually assumed to follow a normal distribution, and the propagation of uncertainty is computed using this assumption. Note that a normally distributed variable has a symmetric distribution about its mean. Quantities that grow exponentially, such as prices, incomes or populations, are often skewed to the right, and hence may be better described by other distributions, such as the log-normal distribution or Pareto distribution. In addition, the probability of seeing a normally distributed value that is far (i.e. more than a few standard deviations) from the mean drops off extremely rapidly. As a result, statistical inference using a normal distribution is not robust to the presence of outliers (data that are unexpectedly far from the mean, due to exceptional circumstances, observational error, etc.). When outliers are expected, data may be better described using a heavy-tailed distribution such as the Student's t-distribution. From a technical perspective, alternative characterizations are possible, for example: The normal distribution is the only absolutely continuous distribution all of whose cumulants beyond the first two (i.e. other than the mean and variance) are zero. For a given mean and variance, the corresponding normal distribution is the continuous distribution with the maximum entropy.[4][5] The normal distributions are a subclass of the elliptical distributions.

Definition
The simplest case of a normal distribution is known as the standard normal distribution, described by the probability density function

The factor in this expression ensures that the total area under the curve (x) is equal to one[proof], and 12 in the exponent makes the "width" of the curve (measured as half the distance between the inflection points) also equal to one. It is traditional in statistics to denote this function with the Greek letter (phi), whereas density functions for all other distributions are usually denoted with letters f orp.[6] The alternative glyph is also used quite often, however within this article "" is reserved to denote characteristic functions. Every normal distribution is the result of exponentiating a quadratic function (just as an exponential distribution results from exponentiating a linear function):

Normal distribution This yields the classic "bell curve" shape, provided that a < 0 so that the quadratic function is concave for x close to 0. f(x) > 0 everywhere. One can adjust a to control the "width" of the bell, then adjust b to move the central peak of the bell along the x-axis, and finally one must choose c such that (which is only possible whena<0). Rather than using a, b, and c, it is far more common to describe a normal distribution by its mean = b2a and variance 2 = 12a. Changing to these new parameters allows one to rewrite the probability density function in a convenient standard form,

136

For a standard normal distribution, = 0 and 2 = 1. The last part of the equation above shows that any other normal distribution can be regarded as a version of the standard normal distribution that has been stretched horizontally by a factor and then translated rightward by a distance. Thus, specifies the position of the bell curve's central peak, and specifies the "width" of the bell curve. The parameter is at the same time the mean, the median and the mode of the normal distribution. The parameter 2 is called the variance; as for any random variable, it describes how concentrated the distribution is around its mean. The square root of 2 is called the standard deviation and is the width of the density function. The normal distribution is usually denoted by N(, 2).[7] Thus when a random variable X is distributed normally with mean and variance 2, we write

Alternative formulations
Some authors advocate using the precision instead of the variance. The precision is normally defined as the reciprocal of the variance ( = 2), although it is occasionally defined as the reciprocal of the standard deviation ( = 1).[8] This parametrization has an advantage in numerical applications where 2 is very close to zero and is more convenient to work with in analysis as is a natural parameter of the normal distribution. This parametrization is common in Bayesian statistics, as it simplifies the Bayesian analysis of the normal distribution. Another advantage of using this parametrization is in the study of conditional distributions in the multivariate normal case. The form of the normal distribution with the more common definition = 2 is as follows:

The question of which normal distribution should be called the "standard" one is also answered differently by various authors. Starting from the works of Gauss the standard normal was considered to be the one with variance 2 = 12 :

Stigler (1982) goes even further and insists the standard normal to be with the variance 2 = 12 :

According to the author, this formulation is advantageous because of a much simpler and easier-to-remember formula, the fact that the pdf has unit height at zero, and simple approximate formulas for the quantiles of the distribution.

Normal distribution

137

Characterization
In the previous section the normal distribution was defined by specifying its probability density function. However there are other ways to characterize a probability distribution. They include: the cumulative distribution function, the moments, the cumulants, the characteristic function, the moment-generating function, etc.

Probability density function


The probability density function (pdf) of a random variable describes the relative frequencies of different values for that random variable. The pdf of the normal distribution is given by the formula explained in detail in the previous section:

This is a proper function only when the variance 2 is not equal to zero. In that case this is a continuous smooth function, defined on the entire real line, and which is called the "Gaussian function". Properties: Function f(x) is unimodal and symmetric around the point x = , which is at the same time the mode, the median and the mean of the distribution.[9] The inflection points of the curve occur one standard deviation away from the mean (i.e., at x = and x = + ).[9] Function f(x) is log-concave.[9] The standard normal density (x) is an eigenfunction of the Fourier transform in that if is a normalized Gaussian function with variance 2, centered at zero, then its Fourier transform is a Gaussian function with variance 1/2. The function is supersmooth of order 2, implying that it is infinitely differentiable.[10] The first derivative of (x) is (x) = x(x); the second derivative is (x) = (x2 1)(x). More generally, the n-th derivative is given by (n)(x) = (1)nHn(x)(x), where Hn is the Hermite polynomial of order n.[11] When 2 = 0, the density function doesn't exist. However a generalized function that defines a measure on the real line, and it can be used to calculate, for example, expected value is

where (x) is the Dirac delta function which is equal to infinity at x = and is zero elsewhere.

Cumulative distribution function


The cumulative distribution function (CDF) describes probability of a random variable falling in the interval (, x]. The CDF of the standard normal distribution is denoted with the capital Greek letter (phi), and can be computed as an integral of the probability density function:

This integral cannot be expressed in terms of elementary functions, so is simply called a transformation of the error function, or erf, a special function. Numerical methods for calculation of the standard normal CDF are discussed below. For a generic normal random variable with mean and variance 2>0 the CDF will be equal to

The complement of the standard normal CDF, Q(x) = 1 (x), is referred to as the Q-function, especially in engineering texts.[12][13] This represents the upper tail probability of the Gaussian distribution: that is, the probability that a standard normal random variable X is greater than the number x. Other definitions of the Q-function, all of which are simple transformations of , are also used occasionally.[14]

Normal distribution Properties: The standard normal CDF is 2-fold rotationally symmetric around point (0,): (x) = 1 (x). The derivative of (x) is equal to the standard normal pdf (x): (x) = (x). The antiderivative of (x) is: (x) dx = x (x) + (x). For a normal distribution with zero variance, the CDF is the Heaviside step function (with H(0) = 1 convention):

138

Quantile function
The quantile function of a distribution is the inverse of the CDF. The quantile function of the standard normal distribution is called the probit function, and can be expressed in terms of the inverse error function:

Quantiles of the standard normal distribution are commonly denoted as zp. The quantile zp represents such a value that a standard normal random variable X has the probability of exactly p to fall inside the (, zp] interval. The quantiles are used in hypothesis testing, construction of confidence intervals and Q-Q plots. The most "famous" normal quantile is 1.96 = z0.975. A standard normal random variable is greater than 1.96 in absolute value in 5% of cases. For a normal random variable with mean and variance 2, the quantile function is

Characteristic function and moment generating function


The characteristic function X(t) of a random variable X is defined as the expected value of eitX, where i is the imaginary unit, and tR is the argument of the characteristic function. Thus the characteristic function is the Fourier transform of the density (x). For a normally distributed X with mean and variance 2, the characteristic function is[15]

The characteristic function can be analytically extended to the entire complex plane: one defines (z) = eiz 122z2 for allzC.[16] The moment generating function is defined as the expected value of etX. For a normal distribution, the moment generating function exists and is equal to

The cumulant generating function is the logarithm of the moment generating function:

Since this is a quadratic polynomial in t, only the first two cumulants are nonzero.

Normal distribution

139

Moments
The normal distribution has moments of all orders. That is, for a normally distributed X with mean and variance 2 , the expectation E[|X|p] exists and is finite for all p such that Re[p] > 1. Usually we are interested only in moments of integer orders: p = 1, 2, 3, . Central moments are the moments of X around its mean . Thus, a central moment of order p is the expected value of (X ) p. Using standardization of normal random variables, this expectation will be equal to p E[Zp], where Z is standard normal.

Here n!! denotes the double factorial, that is the product of every odd number from n to1. Central absolute moments are the moments of |X|. They coincide with regular moments for all even orders, but are nonzero for all odd p's.

The last formula is true for any non-integer p > 1. Raw moments and raw absolute moments are the moments of X and |X| respectively. The formulas for these moments are much more complicated, and are given in terms of confluent hypergeometric functions 1F1 and U.

These expressions remain valid even if p is not integer. See also generalized Hermite polynomials. First two cumulants are equal to and 2 respectively, whereas all higher-order cumulants are equal to zero.
Order 1 2 3 4 5 6 7 8 Raw moment 2 + 2 3 + 32 4 + 622 + 34 5 + 1032 + 154 6 + 1542 + 4524 + 156 7 + 2152 + 10534 + 1056 8 + 2862 + 21044 + 42026 + 1058 Central moment Cumulant 0 2 0 3 4 0 15 6 0 105 8 2 0 0 0 0 0 0

Normal distribution

140

Properties
Standardizing normal random variables
Because the normal distribution is a location-scale family, it is possible to relate all normal random variables to the standard normal. For example if X is normal with mean and variance 2, then

has mean zero and unit variance, that is Z has the standard normal distribution. Conversely, having a standard normal random variable Z we can always construct another normal random variable with specific mean and variance 2:

This "standardizing" transformation is convenient as it allows one to compute the PDF and especially the CDF of a normal distribution having the table of PDF and CDF values for the standard normal. They will be related via

Standard deviation and tolerance intervals


About 68% of values drawn from a normal distribution are within one standard deviation away from the mean; about 95% of the values lie within two standard deviations; and about 99.7% are within three standard deviations. This fact is known as the 68-95-99.7 rule, or the empirical rule, or the 3-sigma rule. To be more precise, the area under the bell curve between n and + n is given by

where erf is the error function. To 12 decimal places, the values for the 1-, 2-, up to 6-sigma points are:[17]

Dark blue is less than one standard deviation from the mean. For the normal distribution, this accounts for about 68% of the set, while two standard deviations from the mean (medium and dark blue) account for about 95%, and three standard deviations (light, medium, and dark blue) account for about 99.7%.

i.e. 1 minus ...

or 1 in ...

1 0.682689492137 0.317310507863 3.15148718753 2 0.954499736104 0.045500263896 21.9778945080 3 0.997300203937 0.002699796063 370.398347345 4 0.999936657516 0.000063342484 15787.1927673 5 0.999999426697 0.000000573303 1744277.89362 6 0.999999998027 0.000000001973 506797345.897

The next table gives the reverse relation of sigma multiples corresponding to a few often used values for the area under the bell curve. These values are useful to determine (asymptotic) tolerance intervals of the specified levels based on normally distributed (or asymptotically normal) estimators:[18]

Normal distribution

141

n 0.80 0.90 0.95 0.98 0.99 0.995 0.998 1.281551565545 1.644853626951 1.959963984540 2.326347874041 2.575829303549 2.807033768344 3.090232306168 0.999 0.9999 0.99999 0.999999 0.9999999 0.99999999

n 3.290526731492 3.890591886413 4.417173413469 4.891638475699 5.326723886384 5.730728868236

0.999999999 6.109410204869

where the value on the left of the table is the proportion of values that will fall within a given interval and n is a multiple of the standard deviation that specifies the width of the interval.

Central limit theorem


The theorem states that under certain (fairly common) conditions, the sum of a large number of random variables will have an approximately normal distribution. For example if (x1, , xn) is a sequence of iid random variables, each having mean and variance 2, then the central limit theorem states that

The theorem will hold even if the summands xi are not iid, although some constraints on the degree of dependence and the growth rate of moments still have to be imposed.

As the number of discrete events increases, the function begins to resemble a normal distribution

The importance of the central limit theorem cannot be overemphasized. A great number of test statistics, scores, and estimators encountered in practice contain sums of certain random variables in them, even more estimators can be represented as sums of random variables through the use of influence functions all of these quantities are governed by the central limit theorem and will have asymptotically normal distribution as a result. Another practical consequence of the central limit theorem is that certain other distributions can be approximated by the normal distribution, for example: The binomial distribution B(n, p) is approximately normal N(np, np(1p)) for large n and for p not too close to zero or one. The Poisson() distribution is approximately normal N(, ) for large values of.

Normal distribution The chi-squared distribution 2(k) is approximately normal N(k, 2k) for large k. The Student's t-distribution t() is approximately normal N(0, 1) when is large. Whether these approximations are sufficiently accurate depends on the purpose for which they are needed, and the rate of convergence to the normal distribution. It is typically the case that such approximations are less accurate in the tails of the distribution. A general upper bound for the approximation error in the central limit theorem is given by the BerryEsseen theorem, improvements of the approximation are given by the Edgeworth expansions.

142

Comparison of probability density functions, p(k) for the sum of n fair 6-sided dice to show their convergence to a normal distribution with increasing n, in accordance to the central limit theorem. In the bottom-right graph, smoothed profiles of the previous graphs are rescaled, superimposed and compared with a normal distribution (black curve).

Miscellaneous
1. The family of normal distributions is closed under linear transformations. That is, if X is normally distributed with mean and variance2, then a linear transform aX + b (for some real numbers a and b) is also normally distributed:

Also if X1, X2 are two independent normal random variables, with means 1, 2 and standard deviations 1, 2, then their linear combination will also be normally distributed: [proof] 2. The converse of (1) is also true: if X1 and X2 are independent and their sum X1 + X2 is distributed normally, then both X1 and X2 must also be normal.[19] This is known as Cramr's decomposition theorem. The interpretation of this property is that a normal distribution is only divisible by other normal distributions. Another application of this property is in connection with the central limit theorem: although the CLT asserts that the distribution of a sum of arbitrary non-normal iid random variables is approximately normal, the Cramr's theorem shows that it can never become exactly normal.[20] 3. If the characteristic function X of some random variable X is of the form X(t) = eQ(t), where Q(t) is a polynomial, then the Marcinkiewicz theorem (named after Jzef Marcinkiewicz) asserts that Q can be at most a quadratic polynomial, and therefore X a normal random variable.[20] The consequence of this result is that the normal distribution is the only distribution with a finite number (two) of non-zero cumulants. 4. If X and Y are jointly normal and uncorrelated, then they are independent. The requirement that X and Y should be jointly normal is essential, without it the property does not hold.[proof] For non-normal random variables

Normal distribution uncorrelatedness does not imply independence. 5. If X and Y are independent N(, 2) random variables, then X + Y and X Y are also independent and identically distributed (this follows from the polarization identity).[21] This property uniquely characterizes normal distribution, as can be seen from the Bernstein's theorem: if X and Y are independent and such that X + Y and X Y are also independent, then both X and Y must necessarily have normal distributions. More generally, if X1, ..., Xn are independent random variables, then two linear combinations akXk and bkXk will be independent if and only if all Xk's are normal and akbk2 k = 0, where 2 [22] k denotes the variance of X . k 6. The normal distribution is infinitely divisible:[23] for a normally distributed X with mean and variance2 we can find n independent random variables {X1, , Xn} each distributed normally with means/n and variances2/n such that 7. The normal distribution is stable (with exponent = 2): if X1, X2 are two independent N(, 2) random variables and a, b are arbitrary real numbers, then where X3 is also N(, 2). This relationship directly follows from property (1).

143

8. The KullbackLeibler divergence between two normal distributions X1 N(1, 21 )and X2 N(2, 22 )is given by:[24]

The Hellinger distance between the same distributions is equal to

9. The Fisher information matrix for a normal distribution is diagonal and takes the form

10. Normal distributions belongs to an exponential family with natural parameters


2

and
2

, and natural

statistics x and x . The dual, expectation parameters for normal distribution are 1 = and 2 = + 2. 11. The conjugate prior of the mean of a normal distribution is another normal distribution.[25] Specifically, if x1, , xn are iid N(, 2) and the prior is ~ N(0, ), then the posterior distribution for the estimator of will be

12. Of all probability distributions over the reals with mean and variance2, the normal distribution N(, 2) is the one with the maximum entropy.[26] 13. The family of normal distributions forms a manifold with constant curvature 1. The same family is flat with respect to the (1)-connections (e) and (m).[27]

Normal distribution

144

Related distributions
Operations on a single random variable
If X is distributed normally with mean and variance 2, then The exponential of X is distributed log-normally: eX ~ lnN (, 2). The absolute value of X has folded normal distribution: |X| ~ Nf (, 2). If = 0 this is known as the half-normal distribution. The square of X/ has the noncentral chi-squared distribution with one degree of freedom: X2/2 ~ 21(2/2). If = 0, the distribution is called simply chi-squared. The distribution of the variable X restricted to an interval [a, b] is called the truncated normal distribution. (X )2 has a Lvy distribution with location 0 and scale 2.

Combination of two independent random variables


If X1 and X2 are two independent standard normal random variables with mean 0 and variance 1, then Their sum and difference is distributed normally with mean zero and variance two: X1 X2 N(0, 2). Their product Z = X1X2 follows the "product-normal" distribution[28] with density function fZ(z) = 1K0(|z|), where K0 is the modified Bessel function of the second kind. This distribution is symmetric around zero, unbounded at z = 0, and has the characteristic function Z(t) = (1 + t 2)1/2. Their ratio follows the standard Cauchy distribution: X1 X2 Cauchy(0, 1). Their Euclidean norm of freedom. has the Rayleigh distribution, also known as the chi distribution with 2 degrees

Combination of two or more independent random variables


If X1, X2, , Xn are independent standard normal random variables, then the sum of their squares has the chi-squared distribution with n degrees of freedom . If X1, X2, , Xn are independent normally distributed random variables with means and variances 2, then their sample mean is independent from the sample standard deviation, which can be demonstrated using Basu's theorem or Cochran's theorem. The ratio of these two quantities will have the Student's t-distribution with n 1 degrees of freedom:

If X1, , Xn, Y1, , Ym are independent standard normal random variables, then the ratio of their normalized sums of squares will have the F-distribution with (n, m) degrees of freedom:

Normal distribution

145

Operations on the density function


The split normal distribution is most directly defined in terms of joining scaled sections of the density functions of different normal distributions and rescaling the density to integrate to one. The truncated normal distribution results from rescaling a section of a single density function.

Extensions
The notion of normal distribution, being one of the most important distributions in probability theory, has been extended far beyond the standard framework of the univariate (that is one-dimensional) case (Case 1). All these extensions are also called normal or Gaussian laws, so a certain ambiguity in names exists. Multivariate normal distribution describes the Gaussian law in the k-dimensional Euclidean space. A vector X Rk is multivariate-normally distributed if any linear combination of its components aj Xj has a (univariate) normal distribution. The variance of X is a kk symmetric positive-definite matrixV. Rectified Gaussian distribution a rectified version of normal distribution with all the negative elements reset to 0 Complex normal distribution deals with the complex normal vectors. A complex vector X Ck is said to be normal if both its real and imaginary components jointly possess a 2k-dimensional multivariate normal distribution. The variance-covariance structure of X is described by two matrices: the variance matrix, and the relation matrixC. Matrix normal distribution describes the case of normally distributed matrices. Gaussian processes are the normally distributed stochastic processes. These can be viewed as elements of some infinite-dimensional Hilbert spaceH, and thus are the analogues of multivariate normal vectors for the case k = . A random element h H is said to be normal if for any constant a H the scalar product (a, h) has a (univariate) normal distribution. The variance structure of such Gaussian random element can be described in terms of the linear covariance operator K: H H. Several Gaussian processes became popular enough to have their own names: Brownian motion, Brownian bridge, OrnsteinUhlenbeck process. Gaussian q-distribution is an abstract mathematical construction which represents a "q-analogue" of the normal distribution. the q-Gaussian is an analogue of the Gaussian distribution, in the sense that it maximises the Tsallis entropy, and is one type of Tsallis distribution. Note that this distribution is different from the Gaussian q-distribution above. One of the main practical uses of the Gaussian law is to model the empirical distributions of many different random variables encountered in practice. In such case a possible extension would be a richer family of distributions, having more than two parameters and therefore being able to fit the empirical distribution more accurately. The examples of such extensions are: Pearson distribution a four-parametric family of probability distributions that extend the normal law to include different skewness and kurtosis values.

Normal distribution

146

Normality tests
Normality tests assess the likelihood that the given data set {x1, , xn} comes from a normal distribution. Typically the null hypothesis H0 is that the observations are distributed normally with unspecified mean and variance 2, versus the alternative Ha that the distribution is arbitrary. A great number of tests (over 40) have been devised for this problem, the more prominent of them are outlined below: "Visual" tests are more intuitively appealing but subjective at the same time, as they rely on informal human judgement to accept or reject the null hypothesis. Q-Q plot is a plot of the sorted values from the data set against the expected values of the corresponding quantiles from the standard normal distribution. That is, it's a plot of point of the form (1(pk), x(k)), where plotting points pk are equal to pk=(k)/(n+12) and is an adjustment constant which can be anything between 0 and1. If the null hypothesis is true, the plotted points should approximately lie on a straight line. P-P plot similar to the Q-Q plot, but used much less frequently. This method consists of plotting the points ((z(k)), pk), where . For normally distributed data this plot should lie on a 45 line between (0,0) and(1,1). WilkShapiro test employs the fact that the line in the Q-Q plot has the slope of . The test compares the least squares estimate of that slope with the value of the sample variance, and rejects the null hypothesis if these two quantities differ significantly. Normal probability plot (rankit plot) Moment tests: D'Agostino's K-squared test JarqueBera test Empirical distribution function tests: Lilliefors test (an adaptation of the KolmogorovSmirnov test) AndersonDarling test

Estimation of parameters
It is often the case that we don't know the parameters of the normal distribution, but instead want to estimate them. That is, having a sample (x1, , xn) from a normal N(, 2) population we would like to learn the approximate values of parameters and 2. The standard approach to this problem is the maximum likelihood method, which requires maximization of the log-likelihood function:

Taking derivatives with respect to and 2 and solving the resulting system of first order conditions yields the maximum likelihood estimates:

Estimator is called the sample mean, since it is the arithmetic mean of all observations. The statistic is complete and sufficient for , and therefore by the LehmannScheff theorem, is the uniformly minimum variance unbiased (UMVU) estimator.[29] In finite samples it is distributed normally:

The variance of this estimator is equal to the -element of the inverse Fisher information matrix . This implies that the estimator is finite-sample efficient. Of practical importance is the fact that the standard error of is proportional to , that is, if one wishes to decrease the standard error by a factor of 10, one must increase the number of points in the sample by a factor of 100. This fact is widely used in determining sample sizes for opinion

Normal distribution polls and the number of trials in Monte Carlo simulations. From the standpoint of the asymptotic theory, is consistent, that is, it converges in probability to as n . The estimator is also asymptotically normal, which is a simple corollary of the fact that it is normal in finite samples:

147

The estimator is called the sample variance, since it is the variance of the sample (x1, , xn). In practice, another estimator is often used instead of the . This other estimator is denoted s2, and is also called the sample variance, which represents a certain ambiguity in terminology; its square root s is called the sample standard deviation. The estimator s2 differs from by having (n 1) instead ofn in the denominator (the so called Bessel's correction):

The difference between s2 and becomes negligibly small for large n's. In finite samples however, the motivation behind the use of s2 is that it is an unbiased estimator of the underlying parameter 2, whereas is biased. Also, by the LehmannScheff theorem the estimator s2 is uniformly minimum variance unbiased (UMVU),[29] which makes it the "best" estimator among all unbiased ones. However it can be shown that the biased estimator is "better" than the s2 in terms of the mean squared error (MSE) criterion. In finite samples both s2 and have scaled chi-squared distribution with (n 1) degrees of freedom:

The first of these expressions shows that the variance of s2 is equal to 24/(n1), which is slightly greater than the -element of the inverse Fisher information matrix . Thus, s2 is not an efficient estimator for 2, and moreover, 2 since s is UMVU, we can conclude that the finite-sample efficient estimator for 2 does not exist. Applying the asymptotic theory, both estimators s2 and are consistent, that is they converge in probability to 2 as the sample size n . The two estimators are also both asymptotically normal: In particular, both estimators are asymptotically efficient for 2. By Cochran's theorem, for normal distribution the sample mean and the sample variance s2 are independent, which means there can be no gain in considering their joint distribution. There is also a reverse theorem: if in a sample the sample mean and sample variance are independent, then the sample must have come from the normal distribution. The independence between and s can be employed to construct the so-called t-statistic:

This quantity t has the Student's t-distribution with (n 1) degrees of freedom, and it is an ancillary statistic (independent of the value of the parameters). Inverting the distribution of this t-statistics will allow us to construct the confidence interval for ;[30] similarly, inverting the 2 distribution of the statistic s2 will give us the confidence interval for 2:[31]

where tk,p and 2 th 2 k,p are the p quantiles of the t- and -distributions respectively. These confidence intervals are of the level 1 , 2 meaning that the true values and fall outside of these intervals with probability . In practice people usually take = 5%, resulting in the 95% confidence intervals. The approximate formulas in the display above were derived from

Normal distribution

148

the asymptotic distributions of and s2. The approximate formulas become valid for large values of n, and are more convenient for the manual calculation since the standard normal quantiles z/2 do not depend on n. In particular, the most popular value of = 5%, results in |z0.025| = 1.96.

Bayesian analysis of the normal distribution


Bayesian analysis of normally-distributed data is complicated by the many different possibilities that may be considered: Either the mean, or the variance, or neither, may be considered a fixed quantity. When the variance is unknown, analysis may be done directly in terms of the variance, or in terms of the precision, the reciprocal of the variance. The reason for expressing the formulas in terms of precision is that the analysis of most cases is simplified. Both univariate and multivariate cases need to be considered. Either conjugate or improper prior distributions may be placed on the unknown variables. An additional set of cases occurs in Bayesian linear regression, where in the basic model the data is assumed to be normally-distributed, and normal priors are placed on the regression coefficients. The resulting analysis is similar to the basic cases of independent identically distributed data, but more complex. The formulas for the non-linear-regression cases are summarized in the conjugate prior article.

The sum of two quadratics


Scalar form The following auxiliary formula is useful for simplifying the posterior update equations, which otherwise become fairly tedious.

This equation rewrites the sum of two quadratics in x by expanding the squares, grouping the terms in x, and completing the square. Note the following about the complex constant factors attached to some of the terms: 1. The factor 2. has the form of a weighted average of y and z. This shows that this factor can be thought of as resulting from a situation where the reciprocals of quantities a and b add directly, so to combine a and b themselves, it's necessary to reciprocate, add, and reciprocate the result again to get back into the original units. This is exactly the sort of operation performed by the harmonic mean, so it is not surprising that and b. is one-half the harmonic mean of a

Normal distribution Vector form A similar formula can be written for the sum of two vector quadratics: If and are symmetric, invertible matrices of size , then are vectors of length , and

149

where

Note that the form

is called a quadratic form and is a scalar:

In other words, it sums up all possible combinations of products of pairs of elements from , with a separate coefficient for each. In addition, since , only the sum matters for any off-diagonal elements of , and there is no loss of generality in assuming that . is symmetric. Furthermore, if is symmetric, then the form

The sum of differences from the mean


Another useful formula is as follows:

where

With known variance


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows with known variance 2, the conjugate prior distribution is also normally-distributed. This can be shown more easily by rewriting the variance as the precision, i.e. using and we proceed as follows. First, the likelihood function is (using the formula above for the sum of differences from the mean): Then if

Then, we proceed as follows:

Normal distribution

150

In the above derivation, we used the formula above for the sum of two quadratics and eliminated all constant factors not involving , i.e. . The result is the kernel of a normal distribution, with mean and precision

This can be written as a set of Bayesian update equations for the posterior parameters in terms of the prior parameters:

That is, to combine

data points with total precision of

(or equivalently, total variance of

) and mean of

values , derive a new total precision simply by adding the total precision of the data to the prior total precision, and form a new mean through a precision-weighted average, i.e. a weighted average of the data mean and the prior mean, each weighted by the associated total precision. This makes logical sense if the precision is thought of as indicating the certainty of the observations: In the distribution of the posterior mean, each of the input components is weighted by its certainty, and the certainty of this distribution is the sum of the individual certainties. (For the intuition of this, compare the expression "the whole is (or is not) greater than the sum of its parts". In addition, consider that the knowledge of the posterior comes from a combination of the knowledge of the prior and likelihood, so it makes sense that we are more certain of it than of either of its components.) The above formula reveals why it is more convenient to do Bayesian analysis of conjugate priors for the normal distribution in terms of the precision. The posterior precision is simply the sum of the prior and likelihood precisions, and the posterior mean is computed through a precision-weighted average, as described above. The same formulas can be written in terms of variance by reciprocating all the precisions, yielding the more ugly formulas

Normal distribution

151

With known mean


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows with known mean , the conjugate prior of the variance has an inverse gamma distribution or a scaled inverse chi-squared distribution. The two are equivalent except for having different parameterizations. The use of the inverse gamma is more common, but the scaled inverse chi-squared is more convenient, so we use it in the following derivation. The prior for 2 is as follows:

The likelihood function from above, written in terms of the variance, is:

where Then:

This is also a scaled inverse chi-squared distribution, where

or equivalently

Reparameterizing in terms of an inverse gamma distribution, the result is:

Normal distribution

152

With unknown mean and variance


For a set of i.i.d. normally-distributed data points X of size n where each individual point x follows with unknown mean and variance 2, the a combined (multivariate) conjugate prior is placed over the mean and variance, consisting of a normal-inverse-gamma distribution. Logically, this originates as follows: 1. From the analysis of the case with unknown mean but known variance, we see that the update equations involve sufficient statistics computed from the data consisting of the mean of the data points and the total variance of the data points, computed in turn from the known variance divided by the number of data points. 2. From the analysis of the case with unknown variance but known mean, we see that the update equations involve sufficient statistics over the data consisting of the number of data points and sum of squared deviations. 3. Keep in mind that the posterior update values serve as the prior distribution when further data is handled. Thus, we should logically think of our priors in terms of the sufficient statistics just described, with the same semantics kept in mind as much as possible. 4. To handle the case where both mean and variance are unknown, we could place independent priors over the mean and variance, with fixed estimates of the average mean, total variance, number of data points used to compute the variance prior, and sum of squared deviations. Note however that in reality, the total variance of the mean depends on the unknown variance, and the sum of squared deviations that goes into the variance prior (appears to) depend on the unknown mean. In practice, the latter dependence is relatively unimportant: Shifting the actual mean shifts the generated points by an equal amount, and on average the squared deviations will remain the same. This is not the case, however, with the total variance of the mean: As the unknown variance increases, the total variance of the mean will increase proportionately, and we would like to capture this dependence. 5. This suggests that we create a conditional prior of the mean on the unknown variance, with a hyperparameter specifying the mean of the pseudo-observations associated with the prior, and another parameter specifying the number of pseudo-observations. This number serves as a scaling parameter on the variance, making it possible to control the overall variance of the mean relative to the actual variance parameter. The prior for the variance also has two hyperparameters, one specifying the sum of squared deviations of the pseudo-observations associated with the prior, and another specifying once again the number of pseudo-observations. Note that each of the priors has a hyperparameter specifying the number of pseudo-observations, and in each case this controls the relative variance of that prior. These are given as two separate hyperparameters so that the variance (aka the confidence) of the two priors can be controlled separately. 6. This leads immediately to the normal-inverse-gamma distribution, which is defined as the product of the two distributions just defined, with conjugate priors used (an inverse gamma distribution over the variance, and a normal distribution over the mean, conditional on the variance) and with the same four parameters just defined. The priors are normally defined as follows:

The update equations can be derived, and look as follows:

Normal distribution The respective numbers of pseudo-observations just add the number of actual observations to them. The new mean hyperparameter is once again a weighted average, this time weighted by the relative numbers of observations. Finally, the update for is similar to the case with known mean, but in this case the sum of squared deviations is taken with respect to the observed data mean rather than the true mean, and as a result a new "interaction term" needs to be added to take care of the additional error source stemming from the deviation between prior and data mean. Proof is as follows.

153

Occurrence
The occurrence of normal distribution in practical problems can be loosely classified into three categories: 1. Exactly normal distributions; 2. Approximately normal laws, for example when such approximation is justified by the central limit theorem; and 3. Distributions modeled as normal the normal distribution being the distribution with maximum entropy for a given mean and variance.

Exact normality
Certain quantities in physics are distributed normally, as was first demonstrated by James Clerk Maxwell. Examples of such quantities are: Velocities of the molecules in the ideal gas. More generally, velocities of the particles in any system in thermodynamic equilibrium will have normal distribution, due to the maximum entropy principle. Probability density function of a ground state in a quantum harmonic oscillator. The position of a particle which experiences diffusion. If initially the The ground state of a quantum harmonic oscillator has the Gaussian distribution. particle is located at a specific point (that is its probability distribution is the dirac delta function), then after time t its location is described by a normal distribution with variance t, which satisfies the diffusion equationt f(x,t) = 12 2x2 f(x,t). If the initial location is given by a certain density function g(x), then the density at time t is the convolution of g and the normal PDF.

Approximate normality
Approximately normal distributions occur in many situations, as explained by the central limit theorem. When the outcome is produced by a large number of small effects acting additively and independently, its distribution will be close to normal. The normal approximation will not be valid if the effects act multiplicatively (instead of additively), or if there is a single external influence which has a considerably larger magnitude than the rest of the effects. In counting problems, where the central limit theorem includes a discrete-to-continuum approximation and where infinitely divisible and decomposable distributions are involved, such as Binomial random variables, associated with binary response variables; Poisson random variables, associated with rare events; Thermal light has a BoseEinstein distribution on very short time scales, and a normal distribution on longer timescales due to the central limit theorem.

Normal distribution

154

Assumed normality
I can only recognize the occurrence of the normal curve the Laplacian curve of errors as a very abnormal phenomenon. It is roughly approximated to in certain distributions; for this reason, and on account for its beautiful simplicity, we may, perhaps, use it as a first approximation, particularly in theoretical investigations. Pearson (1901) There are statistical methods to empirically test that assumption, see the above Normality tests section. In biology, the logarithm of various variables tend to have a normal distribution, that is, they tend to have a log-normal distribution (after separation on male/female subpopulations), with examples including: Measures of size of living tissue (length, height, skin area, weight);[32] The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; presumably the thickness of tree bark also falls under this category; Certain physiological measurements, such as blood pressure of adult humans. In finance, in particular the BlackScholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal (these variables behave like compound interest, not like simple interest, and so are multiplicative). Some mathematicians such as Benot Mandelbrot have argued that log-Levy distributions which possesses heavy tails would be a more appropriate model, in particular for the analysis for stock market crashes. Measurement errors in physical experiments are often modeled by a normal distribution. This use of a normal distribution does not imply that one is assuming the measurement errors are normally distributed, rather using the normal distribution produces the most conservative predictions possible given only knowledge about the mean and variance of the errors.[33] In standardized testing, results can be made to have a normal distribution. This is done by either selecting the number and difficulty of questions (as in the IQ test), or by transforming the raw test scores into "output" scores by fitting them to the normal distribution. For example, the SAT's traditional range of 200800 is based on a normal distribution with a mean of 500 and a standard deviation of 100. Many scores are derived from the normal distribution, including percentile ranks ("percentiles" or "quantiles"), normal curve equivalents, stanines, z-scores, and T-scores. Fitted cumulative normal distribution to October rainfalls Additionally, a number of behavioral statistical procedures are based on the assumption that scores are normally distributed; for example, t-tests and ANOVAs. Bell curve grading assigns relative grades based on a normal distribution of scores. In hydrology the distribution of long duration river discharge or rainfall, e.g. monthly and yearly totals, is often thought to be practically normal according to the central limit theorem.[34] The blue picture illustrates an example of fitting the normal distribution to ranked October rainfalls showing the 90% confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part of the cumulative frequency analysis.

Normal distribution

155

Generating values from normal distribution


In computer simulations, especially in applications of the Monte-Carlo method, it is often desirable to generate values that are normally distributed. The algorithms listed below all generate the standard normal deviates, since a N(, 2) can be generated as X = + Z, where Z is standard normal. All these algorithms rely on the availability of a random number generator U capable of producing uniform random variates. The most straightforward method is based on the probability integral transform property: if U is distributed uniformly on (0,1), then 1(U) will have the standard normal distribution. The drawback The bean machine, a device invented by Francis Galton, can be of this method is that it relies on calculation of the called the first generator of normal random variables. This machine probit function 1, which cannot be done consists of a vertical board with interleaved rows of pins. Small balls are dropped from the top and then bounce randomly left or right as analytically. Some approximate methods are they hit the pins. The balls are collected into bins at the bottom and described in Hart (1968) and in the erf article. settle down into a pattern resembling the Gaussian curve. Wichura[35] gives a fast algorithm for computing this function to 16 decimal places, which is used by R to compute random variates of the normal distribution. An easy to program approximate approach, that relies on the central limit theorem, is as follows: generate 12 uniform U(0,1) deviates, add them all up, and subtract 6 the resulting random variable will have approximately standard normal distribution. In truth, the distribution will be IrwinHall, which is a 12-section eleventh-order polynomial approximation to the normal distribution. This random deviate will have a limited range of (6,6).[36] The BoxMuller method uses two independent random numbers U and V distributed uniformly on (0,1). Then the two random variables X and Y

will both have the standard normal distribution, and will be independent. This formulation arises because for a bivariate normal random vector (X Y) the squared norm X2 + Y2 will have the chi-squared distribution with two degrees of freedom, which is an easily generated exponential random variable corresponding to the quantity 2ln(U) in these equations; and the angle is distributed uniformly around the circle, chosen by the random variable V. Marsaglia polar method is a modification of the BoxMuller method algorithm, which does not require computation of functions sin() and cos(). In this method U and V are drawn from the uniform (1,1) distribution, and then S = U2 + V2 is computed. If S is greater or equal to one then the method starts over, otherwise two quantities

are returned. Again, X and Y will be independent and standard normally distributed. The Ratio method[37] is a rejection method. The algorithm proceeds as follows: Generate two independent uniform deviates U and V; Compute X = 8/e (V 0.5)/U; If X2 5 4e1/4U then accept X and terminate algorithm;

Normal distribution If X2 4e1.35/U + 1.4 then reject X and start over from step 1; If X2 4 / lnU then accept X, otherwise start over the algorithm. The ziggurat algorithm (Marsaglia & Tsang 2000) is faster than the BoxMuller transform and still exact. In about 97% of all cases it uses only two random numbers, one random integer and one random uniform, one multiplication and an if-test. Only in 3% of the cases where the combination of those two falls outside the "core of the ziggurat" a kind of rejection sampling using logarithms, exponentials and more uniform random numbers has to be employed. There is also some investigation into the connection between the fast Hadamard transform and the normal distribution, since the transform employs just addition and subtraction and by the central limit theorem random numbers from almost any distribution will be transformed into the normal distribution. In this regard a series of Hadamard transforms can be combined with random permutations to turn arbitrary data sets into a normally distributed data.

156

Numerical approximations for the normal CDF


The standard normal CDF is widely used in scientific and statistical computing. The values (x) may be approximated very accurately by a variety of methods, such as numerical integration, Taylor series, asymptotic series and continued fractions. Different approximations are used depending on the desired level of accuracy. Abramowitz & Stegun (1964) give the approximation for (x) for x > 0 with the absolute error |(x)|<7.5108 (algorithm 26.2.17 [38]):

where (x) is the standard normal PDF, and b0 = 0.2316419, b1 = 0.319381530, b2 = 0.356563782, b3 = 1.781477937, b4 = 1.821255978, b5 = 1.330274429. Hart (1968) lists almost a hundred of rational function approximations for the erfc() function. His algorithms vary in the degree of complexity and the resulting precision, with maximum absolute precision of 24 digits. An algorithm by West (2009) combines Hart's algorithm 5666 with a continued fraction approximation in the tail to provide a fast computation algorithm with a 16-digit precision. W. J. Cody (1969) after recalling Hart68 solution is not suited for erf, gives a solution for both erf and erfc, with maximal relative error bound, via Rational Chebyshev Approximation. (Cody, W. J. (1969). "Rational Chebyshev Approximations for the Error Function", paper here). Marsaglia (2004) suggested a simple algorithm[39] based on the Taylor series expansion

for calculating (x) with arbitrary precision. The drawback of this algorithm is comparatively slow calculation time (for example it takes over 300 iterations to calculate the function with 16 digits of precision when x = 10). The GNU Scientific Library calculates values of the standard normal CDF using Hart's algorithms and approximations with Chebyshev polynomials.

History
Development
Some authors[40][41] attribute the credit for the discovery of the normal distribution to de Moivre, who in 1738 [42] published in the second edition of his "The Doctrine of Chances" the study of the coefficients in the binomial expansion of (a + b)n. De Moivre proved that the middle term in this expansion has the approximate magnitude of , and that "If m or n be a Quantity infinitely great, then the Logarithm of the Ratio, which a Term distant from the middle by the Interval , has to the middle Term, is ."[43] Although this theorem can be interpreted as

Normal distribution the first obscure expression for the normal probability law, Stigler points out that de Moivre himself did not interpret his results as anything more than the approximate rule for the binomial coefficients, and in particular de Moivre lacked the concept of the probability density function.[44] In 1809 Gauss published his monograph "Theoria motus corporum coelestium in sectionibus conicis solem ambientium" where among other things he introduces several important statistical concepts, such as the method of least squares, the method of maximum likelihood, and the normal distribution. Gauss used M, M, M, to denote the measurements of some unknown quantityV, and sought the "most probable" estimator: the one which maximizes the probability (MV) (MV) (MV) of obtaining the observed experimental results. In his notation is the probability law of the measurement errors of magnitude . Not knowing what the function is, Gauss requires that his method should reduce to the well-known answer: the arithmetic mean of the measured values.[45] Starting from these principles, Gauss demonstrates that the only law which rationalizes the choice of arithmetic mean as an estimator of the location parameter, is the normal law of errors:[46]

157

Carl Friedrich Gauss discovered the normal distribution in 1809 as a way to rationalize the method of least squares.

where h is "the measure of the precision of the observations". Using this normal law as a generic model for errors in the experiments, Gauss formulates what is now known as the non-linear weighted least squares (NWLS) method.[47] Although Gauss was the first to suggest the normal distribution law, Laplace made significant contributions.[48] It was Laplace who first posed the problem of aggregating several observations in 1774,[49] although his own solution led to the Laplacian distribution. It was Laplace who first calculated the value of the integral et dt = in 1782, providing the normalization constant for the normal distribution.[50] Finally, it was Laplace who in 1810 proved and presented to the Academy the fundamental central limit theorem, which emphasized the theoretical importance of the normal distribution.[51] It is of interest to note that in 1809 an American mathematician Adrain published two derivations of the normal probability law, simultaneously and independently from Gauss.[52] His works remained largely unnoticed by the scientific community, until in 1871 they were "rediscovered" by Abbe.[53] In the middle of the 19th century Maxwell demonstrated that the normal distribution is not just a convenient mathematical tool, but may also occur in natural phenomena:[54] "The number of particles whose velocity, resolved in a certain direction, lies between x and x+dx is

Marquis de Laplace proved the central limit theorem in 1810, consolidating the importance of the normal distribution in statistics.

Normal distribution

158

Naming
Since its introduction, the normal distribution has been known by many different names: the law of error, the law of facility of errors, Laplace's second law, Gaussian law, etc. Gauss himself apparently coined the term with reference to the "normal equations" involved in its applications, with normal having its technical meaning of orthogonal rather than "usual".[55] However, by the end of the 19th century some authors[56] had started using the name normal distribution, where the word "normal" was used as an adjective the term now being seen as a reflection of the fact that this distribution was seen as typical, common and thus "normal". Peirce (one of those authors) once defined "normal" thus: "...the 'normal' is not the average (or any other kind of mean) of what actually occurs, but of what would, in the long run, occur under certain circumstances."[57] Around the turn of the 20th century Pearson popularized the term normal as a designation for this distribution.[58] Many years ago I called the LaplaceGaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another 'abnormal'. Pearson (1920) Also, it was Pearson who first wrote the distribution in terms of the standard deviation as in modern notation. Soon after this, in year 1915, Fisher added the location parameter to the formula for normal distribution, expressing it in the way it is written nowadays:

The term "standard normal" which denotes the normal distribution with zero mean and unit variance came into general use around 1950s, appearing in the popular textbooks by P.G. Hoel (1947) "Introduction to mathematical statistics" and A.M. Mood (1950) "Introduction to the theory of statistics".[59] When the name is used, the "Gaussian distribution" was named after Carl Friedrich Gauss, who introduced the distribution in 1809 as a way of rationalizing the method of least squares as outlined above. The related work of Laplace, also outlined above has led to the normal distribution being sometimes called Laplacian, especially in French-speaking countries. Among English speakers, both "normal distribution" and "Gaussian distribution" are in common use, with different terms preferred by different communities.

Notes
[1] The designation "bell curve" is ambiguous: many other distributions are are "bell"-shaped: the Cauchy distribution, Student's t-distribution, generalized normal, logistic, etc. [2] Casella & Berger (2001, p.102) [3] Gale Encyclopedia of Psychology Normal Distribution (http:/ / findarticles. com/ p/ articles/ mi_g2699/ is_0002/ ai_2699000241) [4] Cover, T. M.; Thomas, Joy A (2006). Elements of information theory. John Wiley and Sons. p.254. [5] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [6] Halperin & et al. (1965, item 7) [7] McPherson (1990, p.110) [8] Bernardo & Smith (2000, p.121) [9] Patel & Read (1996, [2.1.4]) [10] Fan (1991, p. 1258) [11] Patel & Read (1996, [2.1.8]) [12] Scott, Clayton; Robert Nowak (August 7, 2003). "The Q-function" (http:/ / cnx. org/ content/ m11537/ 1. 2/ ). Connexions. . [13] Barak, Ohad (April 6, 2006). "Q function and error function" (http:/ / www. eng. tau. ac. il/ ~jo/ academic/ Q. pdf). Tel Aviv University. . [14] Weisstein, Eric W., " Normal Distribution Function (http:/ / mathworld. wolfram. com/ NormalDistributionFunction. html)" from MathWorld. [15] Bryc (1995, p.23) [16] Bryc (1995, p.24)

Normal distribution
[17] WolframAlpha.com (http:/ / www. wolframalpha. com/ input/ ?i=Table[{N(Erf(n/ Sqrt(2)),+ 12),+ N(1-Erf(n/ Sqrt(2)),+ 12),+ N(1/ (1-Erf(n/ Sqrt(2))),+ 12)},+ {n,1,6}]) [18] part 1 (http:/ / www. wolframalpha. com/ input/ ?i=Table[Sqrt(2)*InverseErf(x),+ {x,+ N({8/ 10,+ 9/ 10,+ 19/ 20,+ 49/ 50,+ 99/ 100,+ 995/ 1000,+ 998/ 1000},+ 13)}]), part 2 (http:/ / www. wolframalpha. com/ input/ ?i=Table[{N(1-10^(-x),9),N(Sqrt(2)*InverseErf(1-10^(-x)),13)},{x,3,9}]) [19] Galambos & Simonelli (2004, Theorem3.5) [20] Bryc (1995, p.35) [21] Bryc (1995, p.27) [22] Lukacs & King (1954) [23] Patel & Read (1996, [2.3.6]) [24] http:/ / www. allisons. org/ ll/ MML/ KL/ Normal/ [25] "Stat260: Bayesian Modeling and Inference Lecture Date: February 8th, 2010, The Conjugate Prior for the Normal Distribution, Lecturer: Michael I. Jordan|" (http:/ / www. cs. berkeley. edu/ ~jordan/ courses/ 260-spring10/ lectures/ lecture5. pdf). . [26] Cover & Thomas (2006, p.254) [27] Amari & Nagaoka (2000) [28] Mathworld entry for Normal Product Distribution (http:/ / mathworld. wolfram. com/ NormalProductDistribution. html) [29] Krishnamoorthy (2006, p.127) [30] Krishnamoorthy (2006, p.130) [31] Krishnamoorthy (2006, p.133) [32] Huxley (1932) [33] Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge University Press. pp.592593. [34] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. pp.175224. ISBN90-70754-33-9. . [35] Wichura, M.J. (1988). "Algorithm AS241: The Percentage Points of the Normal Distribution". Applied Statistics (Blackwell Publishing) 37 (3): 477484. doi:10.2307/2347330. JSTOR2347330. [36] Johnson et al. (1995, Equation (26.48)) [37] Kinderman & Monahan (1976) [38] http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_932. htm [39] For example, this algorithm is given in the article Bc programming language. [40] Johnson et al. (1994, page 85) [41] Le Cam (2000, p.74) [42] De Moivre first published his findings in 1733, in a pamphlet "Approximatio ad Summam Terminorum Binomii (a + b)n in Seriem Expansi" that was designated for private circulation only. But it was not until the year 1738 that he made his results publicly available. The original pamphlet was reprinted several times, see for example Walker (1985). [43] De Moivre (1733), Corollary I see Walker (1985, p.77) [44] Stigler (1986, p.76) [45] "It has been customary certainly to regard as an axiom the hypothesis that if any quantity has been determined by several direct observations, made under the same circumstances and with equal care, the arithmetical mean of the observed values affords the most probable value, if not rigorously, yet very nearly at least, so that it is always most safe to adhere to it." Gauss (1809, section 177) [46] Gauss (1809, section 177) [47] Gauss (1809, section 179) [48] "My custom of terming the curve the GaussLaplacian or normal curve saves us from proportioning the merit of discovery between the two great astronomer mathematicians." quote from Pearson (1905, p.189) [49] Laplace (1774, Problem III) [50] Pearson (1905, p.189) [51] Stigler (1986, p.144) [52] Stigler (1978, p.243) [53] Stigler (1978, p.244) [54] Maxwell (1860), p. 23 [55] Jaynes, E J, Probability Theory: The Logic of Science Ch 7 (http:/ / www-biba. inrialpes. fr/ Jaynes/ cc07s. pdf) [56] Besides those specifically referenced here, such use is encountered in the works of Peirce, Galton and Lexis approximately around 1875. [57] Peirce, C. S. (c. 1909 MS), Collected Papers v. 6, paragraph 327. [58] Kruskal & Stigler (1997) [59] "Earliest uses (entry STANDARD NORMAL CURVE)" (http:/ / jeff560. tripod. com/ s. html). .

159

Normal distribution

160

Citations References
Aldrich, John; Miller, Jeff. "Earliest uses of symbols in probability and statistics" (http://jeff560.tripod.com/ stat.html). Aldrich, John; Miller, Jeff. "Earliest known uses of some of the words of mathematics" (http://jeff560.tripod. com/mathword.html). In particular, the entries for "bell-shaped and bell curve" (http://jeff560.tripod.com/b. html), "normal (distribution)" (http://jeff560.tripod.com/n.html), "Gaussian" (http://jeff560.tripod.com/g. html), and "Error, law of error, theory of errors, etc." (http://jeff560.tripod.com/e.html). Amari, Shun-ichi; Nagaoka, Hiroshi (2000). Methods of information geometry. Oxford University Press. ISBN0-8218-0531-2. Bernardo, J. M.; Smith, A.F.M. (2000). Bayesian Theory. Wiley. ISBN0-471-49464-X. Bryc, Wlodzimierz (1995). The normal distribution: characterizations with applications. Springer-Verlag. ISBN0-387-97990-5. Casella, George; Berger, Roger L. (2001). Statistical inference (2nd ed.). Duxbury. ISBN0-534-24312-6. Cover, T. M.; Thomas, Joy A. (2006). Elements of information theory. John Wiley and Sons. de Moivre, Abraham (1738). The Doctrine of Chances. ISBN0-8218-2103-2. Fan, Jianqing (1991). "On the optimal rates of convergence for nonparametric deconvolution problems". The Annals of Statistics 19 (3): 12571272. doi:10.1214/aos/1176348248. JSTOR2241949. Galambos, Janos; Simonelli, Italo (2004). Products of random variables: applications to problems of physics and to arithmetical functions. Marcel Dekker, Inc.. ISBN0-8247-5402-6. Gauss, Carolo Friderico (1809) (in Latin). Theoria motvs corporvm coelestivm in sectionibvs conicis Solem ambientivm [Theory of the motion of the heavenly bodies moving about the Sun in conic sections]. English translation (http://books.google.com/books?id=1TIAAAAAQAAJ). Gould, Stephen Jay (1981). The mismeasure of man (first ed.). W.W. Norton. ISBN0-393-01489-4. Halperin, Max; Hartley, H. O.; Hoel, P. G. (1965). "Recommended standards for statistical symbols and notation. COPSS committee on symbols and notation". The American Statistician 19 (3): 1214. doi:10.2307/2681417. JSTOR2681417. Hart, John F.; et al (1968). Computer approximations. New York: John Wiley & Sons, Inc. ISBN0-88275-642-7. Hazewinkel, Michiel, ed. (2001), "Normal distribution" (http://www.encyclopediaofmath.org/index. php?title=p/n067460), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Herrnstein, C.; Murray (1994). The bell curve: intelligence and class structure in American life. Free Press. ISBN0-02-914673-9. Huxley, Julian S. (1932). Problems of relative growth. London. ISBN0-486-61114-0. OCLC476909537. Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1994). Continuous univariate distributions, Volume 1. Wiley. ISBN0-471-58495-9. Johnson, N.L.; Kotz, S.; Balakrishnan, N. (1995). Continuous univariate distributions, Volume 2. Wiley. ISBN0-471-58494-0. Krishnamoorthy, K. (2006). Handbook of statistical distributions with applications. Chapman & Hall/CRC. ISBN1-58488-635-8. Kruskal, William H.; Stigler, Stephen M. (1997). Normative terminology: 'normal' in statistics and elsewhere. Statistics and public policy, edited by Bruce D. Spencer. Oxford University Press. ISBN0-19-852341-6. la Place, M. de (1774). "Mmoire sur la probabilit des causes par les vnemens". Mmoires de Mathmatique et de Physique, Presents l'Acadmie Royale des Sciences, par divers Savans & ls dans ses Assembles, Tome Sixime: 621656. Translated by S.M.Stigler in Statistical Science 1 (3), 1986: JSTOR2245476. Laplace, Pierre-Simon (1812). Analytical theory of probabilities.

Normal distribution Lukacs, Eugene; King, Edgar P. (1954). "A property of normal distribution". The Annals of Mathematical Statistics 25 (2): 389394. doi:10.1214/aoms/1177728796. JSTOR2236741. McPherson, G. (1990). Statistics in scientific investigation: its basis, application and interpretation. Springer-Verlag. ISBN0-387-97137-8. Marsaglia, George; Tsang, Wai Wan (2000). "The ziggurat method for generating random variables" (http:// www.jstatsoft.org/v05/i08/paper). Journal of Statistical Software 5 (8). Marsaglia, George (2004). "Evaluating the normal distribution" (http://www.jstatsoft.org/v11/i05/paper). Journal of Statistical Software 11 (4). Maxwell, James Clerk (1860). "V. Illustrations of the dynamical theory of gases. Part I: On the motions and collisions of perfectly elastic spheres". Philosophical Magazine, series 4 19 (124): 1932. doi:10.1080/14786446008642818. Patel, Jagdish K.; Read, Campbell B. (1996). Handbook of the normal distribution (2nd ed.). CRC Press. ISBN0-8247-9342-0. Pearson, Karl (1905). "'Das Fehlergesetz und seine Verallgemeinerungen durch Fechner und Pearson'. A rejoinder". Biometrika 4 (1): 169212. JSTOR2331536. Pearson, Karl (1920). "Notes on the history of correlation". Biometrika 13 (1): 2545. doi:10.1093/biomet/13.1.25. JSTOR2331722. Stigler, Stephen M. (1978). "Mathematical statistics in the early states". The Annals of Statistics 6 (2): 239265. doi:10.1214/aos/1176344123. JSTOR2958876. Stigler, Stephen M. (1982). "A modest proposal: a new standard for the normal". The American Statistician 36 (2): 137138. doi:10.2307/2684031. JSTOR2684031. Stigler, Stephen M. (1986). The history of statistics: the measurement of uncertainty before 1900. Harvard University Press. ISBN0-674-40340-1. Stigler, Stephen M. (1999). Statistics on the table. Harvard University Press. ISBN0-674-83601-4. Walker, Helen M (1985). "De Moivre on the law of normal probability" (http://www.york.ac.uk/depts/maths/ histstat/demoivre.pdf). In Smith, David Eugene. A source book in mathematics. Dover. ISBN0-486-64690-4. Weisstein, Eric W. "Normal distribution" (http://mathworld.wolfram.com/NormalDistribution.html). MathWorld. West, Graeme (2009). "Better approximations to cumulative normal functions" (http://www.wilmott.com/pdfs/ 090721_west.pdf). Wilmott Magazine: 7076. Zelen, Marvin; Severo, Norman C. (1964). Probability functions (chapter 26) (http://www.math.sfu.ca/~cbm/ aands/page_931.htm). Handbook of mathematical functions with formulas, graphs, and mathematical tables, by Abramowitz and Stegun: National Bureau of Standards. New York: Dover. ISBN0-486-61272-4.

161

External links
Normal Distribution Video Tutorial Part 1-2 (http://www.youtube.com/watch?v=kB_kYUbS_ig) An 8-foot-tall (2.4m) Probability Machine (named Sir Francis) comparing stock market returns to the randomness of the beans dropping through the quincunx pattern. (http://www.youtube.com/ watch?v=AUSKTk9ENzg) YouTube link originating from Index Funds Advisors (http://www.ifa.com) An interactive Normal (Gaussian) distribution plot (http://peter.freeshell.org/gaussian/)

Student's t-distribution

162

Student's t-distribution
Students t Probability density function

Cumulative distribution function

Parameters Support PDF CDF

> 0 degrees of freedom (real) x (; +)

where 2F1 is the hypergeometric function Mean Median Mode Variance Skewness Ex. kurtosis Entropy

0 for 0 0 for 0 for for

> 1, otherwise undefined

> 2, for 1 <

2, otherwise undefined

> 3, otherwise undefined > 4, for 2 < 4, otherwise undefined

: digamma function, B: beta function

MGF

undefined

Student's t-distribution

163
CF

for

>0
[1]

(x): Bessel function

In probability and statistics, Students t-distribution (or simply the t-distribution) is a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It plays a role in a number of widely used statistical analyses, including the Students t-test for assessing the statistical significance of the difference between two sample means, the construction of confidence intervals for the difference between two population means, and in linear regression analysis. The Students t-distribution also arises in the Bayesian analysis of data from a normal family. The t-distribution is symmetric and bell-shaped, like the normal distribution, but has heavier tails, meaning that it is more prone to producing values that fall far from its mean. This makes it useful for understanding the statistical behavior of certain types of ratios of random quantities, in which variation in the denominator is amplified and may produce outlying values when the denominator of the ratio falls close to zero. The Students t-distribution is a special case of the generalised hyperbolic distribution.

Definition
Probability density function
Student's t-distribution has the probability density function given by

where

is the number of degrees of freedom and

is the Gamma function. This may also be written as

where B is the Beta function. For even,

For

odd,

The overall shape of the probability density function of the t-distribution resembles the bell shape of a normally distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the number of degrees of freedom grows, the t-distribution approaches the normal distribution with mean 0 and variance 1. The following images show the density of the t-distribution for increasing values of . The normal distribution is shown as a blue line for comparison. Note that the t-distribution (red line) becomes closer to the normal distribution as increases.

Student's t-distribution

164

Density of the t-distribution (red) for 1, 2, 3, 5, 10, and 30 df compared to the standard normal distribution (blue). Previous plots shown in green.

1 degree of freedom

2 degrees of freedom

3 degrees of freedom

5 degrees of freedom

10 degrees of freedom

30 degrees of freedom

Cumulative distribution function


The cumulative distribution function can be written in terms of I, the regularized incomplete beta function. For t>0,[2]

with

Other values would be obtained by symmetry. An alternative formula, valid for

, is[2]

Student's t-distribution where 2F1 is a particular case of the hypergeometric function.

165

Special cases
Certain values of Distribution function: give an especially simple form.

Density function:

See Cauchy distribution Distribution function:

Density function:

Density function:

Density function:

See Normal distribution

How the t-distribution arises


Let x1, ..., xn be the numbers observed in a sample from a continuously distributed population with expected value . The sample mean and sample variance are respectively

The resulting t-value is

The t-distribution with n1 degrees of freedom is the sampling distribution of the t-value when the samples consist of independent identically distributed observations from a normally distributed population. Thus for inference purposes t is a useful "pivotal quantity" in the case when the mean and variance are unknown population parameters, in the sense that the t-value has then a probability distribution that depends on neither nor 2.

Student's t-distribution

166

History and etymology


In statistics, the t-distribution was first derived as a posterior distribution in 1876 by Helmert[3][4][5] and L roth.[6][7][8] In the English literature, a derivation of the t-distribution was published in 1908 by William Sealy Gosset[9] while he worked at the Guinness Brewery in Dublin. One version of the origin of the pseudonym Student is that Gosset's employer forbade members of its staff from publishing scientific papers, so he had to hide his identity. Another version is that Guinness did not want their competition to know that they were using the t-test to test the quality of raw material.[10] The t-test and the associated theory became well-known through the work of R.A. Fisher, who called the distribution "Student's distribution".[11][12]

Characterization
As the distribution of a test statistic
Student's t-distribution with
[2][13]

degrees of freedom can be defined as the distribution of the random variable T with

where Z is normally distributed with expected value0 and variance1; V has a chi-squared distribution with ("nu") degrees of freedom; Z and V are independent. A different distribution is defined as that of the random variable defined, for a given constant, by .

This random variable has a noncentral t-distribution with noncentrality parameter. This distribution is important in studies of the power of Student's t test. Derivation Suppose X1, ..., Xn are independent values that are normally distributed with expected value and variance 2. Let

be the sample mean, and

be an unbiased estimate of the variance from the sample. It can be shown that the random variable

has a chi-squared distribution with n1 degrees of freedom (by Cochran's theorem).[14] It is readily shown that the quantity

is normally distributed with mean 0 and variance 1, since the sample mean and variance

is normally distributed with mean

. Moreover, it is possible to show that these two random variables (the normally distributed one and

the chi-squared-distributed one) are independent. Consequently the pivotal quantity,

Student's t-distribution which differs from Z in that the exact standard deviation is replaced by the random variable Sn, has a Student's t-distribution as defined above. Notice that the unknown population variance 2 does not appear in T, since it was in both the numerator and the denominators, so it canceled. Gosset intuitively obtained the probability density function stated above, with equal to n1, and Fisher proved it in 1925.[15] The distribution of the test statistic, T, depends on , but not or ; the lack of dependence on and is what makes the t-distribution important in both theory and practice.

167

As a maximum entropy distribution


Student's t-distribution is the maximum entropy probability distribution for a random variate X for which is fixed.[16]

Properties
Moments
The raw moments of the t-distribution are

The distinction between "undefined" and "defined with the value of infinity" should be kept in mind. This is equivalent to the distinction between the result of 0/0 vs. 1/0. Attempting to evaluate the odd moments in the cases above listed as "undefined" results in the expression Because the mean (first raw moment) is undefined when (equivalent to the Cauchy distribution), all of the central moments and standardized moments are , keven, may be simplified using the properties of the Gamma likewise undefined, including the variance, skewness and kurtosis. It should be noted that the term for 0<k< function to

For a t-distribution with degrees of freedom, the expected value is0, and its variance is The skewness is 0 if >3 and the excess kurtosis is 6/( 4) if >4.

/(

2) if

>2.

Relation to F distribution
has an F-distribution if and has a Student's t-distribution.

Monte Carlo sampling


There are various approaches to constructing random samples from the Student-t distribution. The matter depends on whether the samples are required on a stand-alone basis, or are to be constructed by application of a quantile function to uniform samples; e.g., in the multi-dimensional applications basis of copula-dependency. In the case of stand-alone sampling, an extension of the BoxMuller method and its polar form is easily deployed.[17] It has the merit that it applies equally well to all real positive degrees of freedom, , while many other candidate methods fail if is close to zero.[17]

Student's t-distribution

168

Integral of Student's probability density function and p-value


The function is the integral of Student's probability density function, (t) between t and t, for t 0. It thus gives the probability that a value of t less than that calculated from observed data would occur by chance. Therefore, the function can be used when testing whether the difference between the means of two sets of data is statistically significant, by calculating the corresponding value of t and the probability of its occurrence if the two sets of data were drawn from the same population. This is used in a variety of situations, particularly in t-tests. For the statistic t, with degrees of freedom, is the probability that t would be less than the observed value if the two means were the same (provided that the smaller mean is subtracted from the larger, so that t0). It can be easily calculated from the cumulative distribution function of the t-distribution:

where Ix is the regularized incomplete beta function (a,b). For statistical hypothesis testing this function is used to construct the p-value.

Non-standardized Student's t-distribution


In terms of standard deviation
Student's t distribution can be generalized to a three parameter location-scale family, introducing a location parameter and a scale parameter . The resulting non-standardized Student's t-distribution has a density defined by[18]

Equivalently, it can be written in terms of

(corresponding to variance instead of standard deviation):

Other properties of this version of the distribution are[18]:

This distribution results from compounding a Gaussian distribution (normal distribution) with mean unknown variance, with an inverse gamma distribution placed over the variance with parameters

and and

. In other words, the random variable X is assumed to have a Gaussian distribution with an unknown variance distributed as inverse gamma, and then the variance is marginalized out (integrated out). The reason for the usefulness of this characterization is that the inverse gamma distribution is the conjugate prior distribution of the variance of a Gaussian distribution. As a result, the non-standardized Student's t-distribution arises naturally in many Bayesian inference problems. See below. Equivalently, this distribution results from compounding a Gaussian distribution with a scaled-inverse-chi-squared distribution with parameters and . The scaled-inverse-chi-squared distribution is exactly the same distribution as the inverse gamma distribution, but with a different parameterization, i.e.

Student's t-distribution

169

In terms of precision
An alternative parameterization in terms of precision (reciprocal of variance) arises from the relation Then the density is defined by
[19]

Other properties of this version of the distribution are[19]:

This distribution results from compounding a Gaussian distribution with mean

and unknown precision (the and

reciprocal of the variance), with a gamma distribution placed over the precision with parameters precision distributed as gamma, and then this is marginalized over the gamma distribution.

. In other words, the random variable X is assumed to have a normal distribution with an unknown

Related distributions
Noncentral t-distribution
The noncentral t-distribution is a different way of generalizing the t-distribution to include a location parameter. Unlike the nonstandardized t-distributions, the noncentral distributions are asymmetric (the median is not the same as the mode).

Discrete Student's t-distribution


The "discrete Student's t-distribution" is defined by its probability mass function at r being proportional to[20]

Here a, b, and k are parameters. This distribution arises from the construction of a system of discrete distributions similar to that of the Pearson distributions for continuous distributions.[21]

Uses
In frequentist statistical inference
Student's t-distribution arises in a variety of statistical estimation problems where the goal is to estimate an unknown parameter, such as a mean value, in a setting where the data are observed with additive errors. If (as in nearly all practical statistical work) the population standard deviation of these errors is unknown and has to be estimated from the data, the t-distribution is often used to account for the extra uncertainty that results from this estimation. In most such problems, if the standard deviation of the errors were known, a normal distribution would be used instead of the t-distribution. Confidence intervals and hypothesis tests are two statistical procedures in which the quantiles of the sampling distribution of a particular statistic (e.g. the standard score) are required. In any situation where this statistic is a linear function of the data, divided by the usual estimate of the standard deviation, the resulting quantity can be rescaled and centered to follow Student's t-distribution. Statistical analyses involving means, weighted means, and regression coefficients all lead to statistics having this form.

Student's t-distribution Quite often, textbook problems will treat the population standard deviation as if it were known and thereby avoid the need to use the Student's t-distribution. These problems are generally of two kinds: (1) those in which the sample size is so large that one may treat a data-based estimate of the variance as if it were certain, and (2) those that illustrate mathematical reasoning, in which the problem of estimating the standard deviation is temporarily ignored because that is not the point that the author or instructor is then explaining. Hypothesis testing A number of statistics can be shown to have t-distributions for samples of moderate size under null hypotheses that are of interest, so that the t-distribution forms the basis for significance tests. For example, the distribution of Spearman's rank correlation coefficient , in the null case (zero correlation) is well approximated by the t distribution for sample sizes above about20. Confidence intervals Suppose the number A is so chosen that

170

when T has a t-distribution with n1 degrees of freedom. By symmetry, this is the same as saying that A satisfies

so A is the "95th percentile" of this probability distribution, or

. Then

and this is equivalent to

Therefore the interval whose endpoints are

is a 90-percent confidence interval for . Therefore, if we find the mean of a set of observations that we can reasonably expect to have a normal distribution, we can use the t-distribution to examine whether the confidence limits on that mean include some theoretically predicted value - such as the value predicted on a null hypothesis. It is this result that is used in the Student's t-tests: since the difference between the means of samples from two normal distributions is itself distributed normally, the t-distribution can be used to examine whether that difference can reasonably be supposed to be zero. If the data are normally distributed, the one-sided (1a)-upper confidence limit (UCL) of the mean, can be calculated using the following equation:

The resulting UCL will be the greatest average value that will occur for a given confidence interval and population size. In other words, being the mean of the set of observations, the probability that the mean of the distribution is inferior to UCL1a is equal to the confidence level 1a.

Student's t-distribution Prediction intervals The t-distribution can be used to construct a prediction interval for an unobserved sample from a normal distribution with unknown mean and variance.

171

In Bayesian statistics
The Student's t-distribution, especially in its three-parameter (location-scale) version, arises frequently in Bayesian statistics as a result of its connection with the normal distribution. Whenever the variance of a normally distributed random variable is unknown and a conjugate prior placed over it that follows an inverse gamma distribution, the resulting marginal distribution of the variable will follow a Student's t-distribution. Equivalent constructions with the same results involve a conjugate scaled-inverse-chi-squared distribution over the variance, or a conjugate gamma distribution over the precision. If an improper prior proportional to is placed over the variance, the t-distribution also arises. This is the case regardless of whether the mean of the normally distributed variable is known, is unknown distributed according to a conjugate normally distributed prior, or is unknown distributed according to an improper constant prior. Related situations that also produce a t-distribution are: The marginal posterior distribution of the unknown mean of a normally distributed variable, with unknown prior mean and variance following the above model. The prior predictive distribution and posterior predictive distribution of a new normally distributed data point when a series of independent identically distributed normally distributed data points have been observed, with prior mean and variance as in the above model.

Robust parametric modeling


The t-distribution is often used as an alternative to the normal distribution as a model for data.[22] It is frequently the case that real data have heavier tails than the normal distribution allows for. The classical approach was to identify outliers and exclude or downweight them in some way. However, it is not always easy to identify outliers (especially in high dimensions), and the t-distribution is a natural choice of model for such data and provides a parametric approach to robust statistics. Lange et al. explored the use of the t-distribution for robust modeling of heavy tailed data in a variety of contexts. A Bayesian account can be found in Gelman et al. The degrees of freedom parameter controls the kurtosis of the distribution and is correlated with the scale parameter. The likelihood can have multiple local maxima and, as such, it is often necessary to fix the degrees of freedom at a fairly low value and estimate the other parameters taking this as given. Some authors report that values between 3 and 9 are often good choices. Venables and Ripley suggest that a value of 5 is often a good choice.

Table of selected values


Most statistical textbooks list t distribution tables. Nowadays, the better way to a fully precise critical t value or a cumulative probability is the statistical function implemented in spreadsheets (Office Excel, OpenOffice Calc, etc.), or an interactive calculating web page. The relevant spreadsheet functions are TDIST and TINV, while online calculating pages save troubles like positions of parameters or names of functions. For example, a Mediawiki page supported by R extension can easily give the interactive result [23] of critical values or cumulative probability, even for noncentral t-distribution. The following table lists a few selected values for t-distributions with degrees of freedom for a range of one-sided or two-sided critical regions. For an example of how to read this table, take the fourth row, which begins with 4; that means , the number of degrees of freedom, is 4 (and if we are dealing, as above, with n values with a fixed sum, n =5). Take the fifth entry, in the column headed 95% for one-sided (90% for two-sided). The value of that entry is

Student's t-distribution "2.132". Then the probability that T is less than 2.132 is 95% or Pr(<T<2.132)=0.95; or mean that Pr(2.132<T<2.132)=0.9. This can be calculated by the symmetry of the distribution, Pr(T<2.132)=1Pr(T>2.132) = 10.95 = 0.05, and so Pr(2.132<T<2.132) = 12(0.05) = 0.9. Note that the last row also gives critical points: a t-distribution with infinitely many degrees of freedom is a normal distribution. (See Related distributions above). The first column is the number of degrees of freedom.
One Sided 75% Two Sided 50% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 80% 60% 85% 70% 90% 80% 95% 97.5% 99% 99.5% 99.75% 99.9% 99.95% 90% 95% 98% 99% 99.5% 127.3 14.09 7.453 5.598 4.773 4.317 4.029 3.833 3.690 3.581 3.497 3.428 3.372 3.326 3.286 3.252 3.222 3.197 3.174 3.153 3.135 3.119 3.104 3.091 3.078 3.067 3.057 3.047 99.8% 318.3 22.33 10.21 7.173 5.893 5.208 4.785 4.501 4.297 4.144 4.025 3.930 3.852 3.787 3.733 3.686 3.646 3.610 3.579 3.552 3.527 3.505 3.485 3.467 3.450 3.435 3.421 3.408 99.9% 636.6 31.60 12.92 8.610 6.869 5.959 5.408 5.041 4.781 4.587 4.437 4.318 4.221 4.140 4.073 4.015 3.965 3.922 3.883 3.850 3.819 3.792 3.767 3.745 3.725 3.707 3.690 3.674

172

1.000 1.376 1.963 3.078 6.314 12.71 0.816 1.061 1.386 1.886 2.920 4.303 0.765 0.978 1.250 1.638 2.353 3.182 0.741 0.941 1.190 1.533 2.132 2.776 0.727 0.920 1.156 1.476 2.015 2.571 0.718 0.906 1.134 1.440 1.943 2.447 0.711 0.896 1.119 1.415 1.895 2.365 0.706 0.889 1.108 1.397 1.860 2.306 0.703 0.883 1.100 1.383 1.833 2.262 0.700 0.879 1.093 1.372 1.812 2.228 0.697 0.876 1.088 1.363 1.796 2.201 0.695 0.873 1.083 1.356 1.782 2.179 0.694 0.870 1.079 1.350 1.771 2.160 0.692 0.868 1.076 1.345 1.761 2.145 0.691 0.866 1.074 1.341 1.753 2.131 0.690 0.865 1.071 1.337 1.746 2.120 0.689 0.863 1.069 1.333 1.740 2.110 0.688 0.862 1.067 1.330 1.734 2.101 0.688 0.861 1.066 1.328 1.729 2.093 0.687 0.860 1.064 1.325 1.725 2.086 0.686 0.859 1.063 1.323 1.721 2.080 0.686 0.858 1.061 1.321 1.717 2.074 0.685 0.858 1.060 1.319 1.714 2.069 0.685 0.857 1.059 1.318 1.711 2.064 0.684 0.856 1.058 1.316 1.708 2.060 0.684 0.856 1.058 1.315 1.706 2.056 0.684 0.855 1.057 1.314 1.703 2.052 0.683 0.855 1.056 1.313 1.701 2.048

31.82 63.66 6.965 9.925 4.541 5.841 3.747 4.604 3.365 4.032 3.143 3.707 2.998 3.499 2.896 3.355 2.821 3.250 2.764 3.169 2.718 3.106 2.681 3.055 2.650 3.012 2.624 2.977 2.602 2.947 2.583 2.921 2.567 2.898 2.552 2.878 2.539 2.861 2.528 2.845 2.518 2.831 2.508 2.819 2.500 2.807 2.492 2.797 2.485 2.787 2.479 2.779 2.473 2.771 2.467 2.763

Student's t-distribution

173
29 30 40 50 60 80 100 120 0.683 0.854 1.055 1.311 1.699 2.045 0.683 0.854 1.055 1.310 1.697 2.042 0.681 0.851 1.050 1.303 1.684 2.021 0.679 0.849 1.047 1.299 1.676 2.009 0.679 0.848 1.045 1.296 1.671 2.000 0.678 0.846 1.043 1.292 1.664 1.990 0.677 0.845 1.042 1.290 1.660 1.984 0.677 0.845 1.041 1.289 1.658 1.980 0.674 0.842 1.036 1.282 1.645 1.960 2.462 2.756 2.457 2.750 2.423 2.704 2.403 2.678 2.390 2.660 2.374 2.639 2.364 2.626 2.358 2.617 2.326 2.576 3.038 3.030 2.971 2.937 2.915 2.887 2.871 2.860 2.807 3.396 3.385 3.307 3.261 3.232 3.195 3.174 3.160 3.090 3.659 3.646 3.551 3.496 3.460 3.416 3.390 3.373 3.291

The number at the beginning of each row in the table above is which has been defined above as n1. The percentage along the top is 100%(1). The numbers in the main body of the table are t, . If a quantity T is distributed as a Student's t distribution with degrees of freedom, then there is a probability 1 that T will be less than t, .(Calculated as for a one-tailed or one-sided test as opposed to a two-tailed test.) For example, given a sample with a sample variance 2 and sample mean of 10, taken from a sample set of 11 (10 degrees of freedom), using the formula

We can determine that at 90% confidence, we have a true mean lying below

(In other words, on average, 90% of the times that an upper threshold is calculated by this method, this upper threshold exceeds the true mean.) And, still at 90% confidence, we have a true mean lying over

(In other words, on average, 90% of the times that a lower threshold is calculated by this method, this lower threshold lies below the true mean.) So that at 80% confidence (calculated from 12(190%) = 80%), we have a true mean lying within the interval

This is generally expressed in interval notation, e.g., for this case, at 80% confidence the true mean is within the interval [9.41490,10.58510]. (In other words, on average, 80% of the times that upper and lower thresholds are calculated by this method, the true mean is both below the upper threshold and above the lower threshold. This is not the same thing as saying that there is an 80% probability that the true mean lies between a particular pair of upper and lower thresholds that have been calculated by this methodsee confidence interval and prosecutor's fallacy.) For information on the inverse cumulative distribution function see quantile function.

Student's t-distribution

174

Notes
[1] Hurst, Simon, The Characteristic Function of the Student-t Distribution (http:/ / wwwmaths. anu. edu. au/ research. reports/ srr/ 95/ 044/ ), Financial Mathematics Research Report No. FMRR006-95, Statistics Research Report No. SRR044-95 [2] Johnson, N.L., Kotz, S., Balakrishnan, N. (1995) Continuous Univariate Distributions, Volume 2, 2nd Edition. Wiley, ISBN 0-471-58494-0 (Chapter 28) [3] Helmert, F. R. (1875). "ber die Bestimmung des wahrscheinlichen Fehlers aus einer endlichen Anzahl wahrer Beobachtungsfehler". Z. Math. Phys., 20, 300-3. [4] Helmert, F. R. (1876a). "ber die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und uber einige damit in Zusammenhang stehende Fragen". Z. Math. Phys., 21, 192-218. [5] Helmert, F. R. (1876b). "Die Genauigkeit der Formel von Peters zur Berechnung des wahrscheinlichen Beobachtungsfehlers director Beobachtungen gleicher Genauigkeit", Astron. Nachr., 88, 113-32. [6] L roth, J (1876). "Vergleichung von zwei Werten des wahrscheinlichen Fehlers". Astron. Nachr. 87 (14): 20920. Bibcode1876AN.....87..209L. doi:10.1002/asna.18760871402. [7] Pfanzagl, J.; Sheynin, O. (1996). "A forerunner of the t-distribution (Studies in the history of probability and statistics XLIV)" (http:/ / biomet. oxfordjournals. org/ cgi/ content/ abstract/ 83/ 4/ 891). Biometrika 83 (4): 891898. doi:10.1093/biomet/83.4.891. MR1766040. . [8] Sheynin, O (1995). "Helmert's work in the theory of errors". Arch. Hist. Ex. Sci. 49: 73104. doi:10.1007/BF00374700. [9] Student [William Sealy Gosset] (March 1908). "The probable error of a mean" (http:/ / www. york. ac. uk/ depts/ maths/ histstat/ student. pdf). Biometrika 6 (1): 125. doi:10.1093/biomet/6.1.1. . [10] Mortimer, Robert G. (2005) Mathematics for Physical Chemistry,Academic Press. 3 edition. ISBN 0-12-508347-5 (page 326) [11] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf). Metron 5: 90104. . [12] Walpole, Ronald; Myers, Raymond; Myers, Sharon; Ye, Keying. (2002) Probability and Statistics for Engineers and Scientists. Pearson Education, 7th edition, pg. 237 ISBN 81-7758-404-9 [13] Hogg & Craig (1978, Sections 4.4 and 4.8.) [14] Cochran, W. G. (April 1934). "The distribution of quadratic forms in a normal system, with applications to the analysis of covariance". Mathematical Proceedings of the Cambridge Philosophical Society 30 (2): 178191. Bibcode1934PCPS...30..178C. doi:10.1017/S0305004100016595. [15] Fisher, R. A. (1925). "Applications of "Student's" distribution" (http:/ / digital. library. adelaide. edu. au/ coll/ special/ fisher/ 43. pdf). Metron 5: 90104. . [16] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [17] Bailey, R. W. (1994). "Polar Generation of Random Variates with the t-Distribution". Mathematics of Computation 62 (206): 779781. doi:10.2307/2153537. [18] Jackman, Simon (2009). Bayesian Analysis for the Social Sciences. Wiley. [19] Bishop, C.M. (2006). Pattern recognition and machine learning. Springer. [20] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Table 5.1) [21] Ord, J.K. (1972) Families of Frequency Distributions, Griffin. ISBN 0-85264-137-0 (Chapter 5) [22] Lange, Kenneth L.; Little, Roderick J.A.; Taylor, Jeremy M.G. (1989). "Robust statistical modeling using the t-distribution". JASA 84 (408): 881896. JSTOR2290063. [23] http:/ / mars. wiwi. hu-berlin. de/ mediawiki/ slides/ index. php/ Comparison_of_noncentral_and_central_distributions

References
Senn, S.; Richardson, W. (1994). "The first t-test". Statistics in Medicine 13 (8): 785803. doi:10.1002/sim.4780130802. PMID8047737. Hogg, R.V.; Craig, A.T. (1978). Introduction to Mathematical Statistics. New York: Macmillan. Venables, W.N.; B.D. Ripley, B.D. (2002)Modern Applied Statistics with S, Fourth Edition, Springer Gelman, Andrew; John B. Carlin, Hal S. Stern, Donald B. Rubin (2003). Bayesian Data Analysis (Second Edition) (http://www.stat.columbia.edu/~gelman/book/). CRC/Chapman & Hall. ISBN1-58488-388-X.

Student's t-distribution

175

External links
Earliest Known Uses of Some of the Words of Mathematics (S) (http://jeff560.tripod.com/s.html) (Remarks on the history of the term "Student's distribution")

176

Continuous Distributions on [0,Inf)


Gamma distribution
Gamma Probability density function

Cumulative distribution function

Parameters Support Probability density function (pdf) Cumulative distribution function (cdf) Mean Median Mode Variance Skewness

shape scale

shape rate

No simple closed form No simple closed form

Gamma distribution

177
Excess kurtosis Entropy Moment-generating function (mgf) Characteristic function

In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. There are two different parameterizations in common use: 1. With a shape parameter k and a scale parameter . 2. With a shape parameter =k and an inverse scale parameter =1, called a rate parameter. The parameterization with k and appears to be more common in econometrics and certain other applied fields, where e.g. the gamma distribution is frequently used to model waiting times. For instance, in life testing, the waiting time until death is a random variable that is frequently modeled with a gamma distribution.[1] The parameterization with and is more common in Bayesian statistics, where the gamma distribution is used as a conjugate prior distribution for various types of inverse scale (aka rate) parameters, such as the of an exponential distribution or a Poisson distribution or for that matter, the of the gamma distribution itself. (The closely related inverse gamma distribution is used as a conjugate prior for scale parameters, such as the variance of a normal distribution.) If k is an integer, then the distribution represents an Erlang distribution; i.e., the sum of k independent exponentially distributed random variables, each of which has a mean of (which is equivalent to a rate parameter of 1/). Equivalently, if is an integer, then the distribution again represents an Erlang distribution, i.e. the sum of independent exponentially distributed random variables, each of which has a mean of 1/ (which is equivalent to a rate parameter of ). The gamma distribution is the maximum entropy probability distribution for a random variable X for which is fixed and greater than zero, and is fixed ( is the digamma function).[2]

Gamma distribution

178

Characterization using shape k and scale


A random variable X that is gamma-distributed with shape k and scale is denoted

Probability density function


The probability density function of the gamma distribution can be expressed in terms of the gamma function parameterized in terms of a shape parameter k and scale parameter . Both k and will be positive values. The equation defining the probability density function of a gamma-distributed random variable x is

Illustration of the Gamma PDF for parameter values over k and x with set to 1,2,3,4,5and6. One can see each layer by itself here [3] as well as byk [4] andx. [5].

(This parameterization is used in the infobox and the plots.)

Cumulative distribution function


The cumulative distribution function is the regularized gamma function::

where

is the lower incomplete gamma function.

It can also be expressed as follows, if k is a positive integer (i.e., the distribution is an Erlang distribution):[6]

Gamma distribution

179

Characterization using shape and rate


Alternatively, the gamma distribution can be parameterized in terms of a shape parameter =k and an inverse scale parameter =1, called a rate parameter:

If is a positive integer, then

A random variable X that is gamma-distributed with shape and scale is denoted

Both parametrizations are common because either can be more convenient depending on the situation.

Cumulative distribution function


The cumulative distribution function is the regularized gamma function:

where

is the lower incomplete gamma function.

It can also be expressed as follows, if is a positive integer (i.e., the distribution is an Erlang distribution):[6]

Properties
Skewness
The skewness depends only on the first parameter ( ). It approaches a normal distribution when is large (approximately when > 10).

Median calculation
Unlike the mode and the mean which have readily calculable formulas based on the parameters, the median does not have an easy closed form equation. The median for this distribution is defined as the constant x0 such that

The ease of this calculation is dependent on the k parameter. This is best achieved by a computer since the calculations can quickly grow out of control. For the ( n + 1, 1 ) distribution the median ( ) is known[7] to lie between

This estimate has been improved[8]

A method of estimating the median for any Gamma distribution has been derived based on the ratio /( - ) which to a very good approximation is a linear function of the shape parameter when 1.[9] The median estimated by this method is approximately

Gamma distribution

180

where is the mean.

Summation
If Xi has a (ki, ) distribution for i=1,2,...,N (i.e., all distributions have the same scale parameter ), then

provided all Xi are independent. For the cases where the Xi are independent but have different scale parameters see Mathai (1982) and Moschopoulos (1984). The gamma distribution exhibits infinite divisibility.

Scaling
If

then for any c>0,

Hence the use of the term "scale parameter" to describe . Equivalently, if

then for any c>0,

Hence the use of the term "inverse scale parameter" to describe .

Exponential family
The Gamma distribution is a two-parameter exponential family with natural parameters k1 and 1 (equivalently, 1 and ), and natural statistics X and ln(X). If the shape parameter is held fixed, the resulting one-parameter family of distributions is a natural exponential family.

Logarithmic expectation
One can show that

or equivalently,

where () or (k) is the digamma function. This can be derived using the exponential family formula for the moment generating function of the sufficient statistic, because one of the sufficient statistics of the gamma distribution is

Gamma distribution

181

Information entropy
The information entropy can be derived as

In the k, parameterization, the information entropy is given by

KullbackLeibler divergence
The KullbackLeibler divergence (KL-divergence), as with the information entropy and various other theoretical properties, are more commonly seen using the , parameterization because of their uses in Bayesian and other theoretical statistics frameworks. The distribution) by[10] KL-divergence ("true" from of

("approximating" distribution) is given


Illustration of the KullbackLeibler (KL) divergence for two Gamma PDFs. Here =0+1 which are set to 1,2,3,4,5and6. The typical asymmetry for the KL divergence is clearly visible.

Written using the k, parameterization, the KL-divergence of by

from

is given

Gamma distribution

182

Laplace transform
The Laplace transform of the gamma PDF is

Parameter estimation
Maximum likelihood estimation
The likelihood function for N iid observations (x1,...,xN) is

from which we calculate the log-likelihood function

Finding the maximum with respect to by taking the derivative and setting it equal to zero yields the maximum likelihood estimator of the parameter:

Substituting this into the log-likelihood function gives

Finding the maximum with respect to k by taking the derivative and setting it equal to zero yields

where

is the digamma function. There is no closed-form solution for k. The function is numerically very well behaved, so if a numerical solution is desired, it can be found using, for example, Newton's method. An initial value of k can be found either using the method of moments, or using the approximation

If we let

then k is approximately

which is within 1.5% of the correct value. An explicit form for the Newton-Raphson update of this initial guess is given by Choi and Wette (1969) as the following expression:

Gamma distribution

183

where

denotes the trigamma function (the derivative of the digamma function).

The digamma and trigamma functions can be difficult to calculate with high precision. However, approximations known to be good to several significant figures can be computed using the following approximation formulae:

and

For details, see Choi and Wette (1969).

Bayesian minimum mean-squared error


With known k and unknown , the posterior PDF for theta (using the standard scale-invariant prior for ) is

Denoting

Integration over can be carried out using a change of variables, revealing that 1 is gamma-distributed with parameters .

The moments can be computed by taking the ratio (m by m = 0)

which shows that the mean standard deviation estimate of the posterior distribution for theta is

Generating gamma-distributed random variables


Given the scaling property above, it is enough to generate gamma variables with value of with simple division. Using the fact that a distribution is the same as an exponential variables, we conclude that if is uniformly distributed on , then as we can later convert to any

distribution, and noting the method of generating is distributed

Now, using the "-addition" property of gamma distribution, we expand this result:

where

are all uniformly distributed on for

and independent. All that is left now is to generate a variable

distributed as

and apply the "-addition" property once more. This is the most difficult part.

Random generation of gamma variates is discussed in detail by Devroye,[11] noting that none are uniformly fast for all shape parameters. For small values of the shape parameter, the algorithms are often not valid.[12] For arbitrary

Gamma distribution values of the shape parameter, one can apply the Ahrens and Dieter[13] modified acceptance-rejection method Algorithm GD (shape k 1), or transformation method[14] when 0 < k < 1. Also see Cheng and Feast Algorithm GKM 3[15] or Marsaglia's squeeze method.[16] The following is a version of the Ahrens-Dieter acceptance-rejection method:[13] 1. Let be 1. 2. Generate 3. If 4. Let 5. Let 6. If 7. Assume , , where and as independent uniformly distributed on , then go to step 4, else go to step 5. . Go to step 6. . , then increment and go to step 2. to be the realization of . variables.

184

A summary of this is

where is the integral part of , ), has been generated using the algorithm above with (the fractional part of and are distributed as explained above and are all independent.

Related distributions
Special cases
If If , then X has an exponential distribution with rate parameter . , then X is identical to 2(), the chi-squared distribution with degrees of freedom.

Conversely, if and c is a positive constant, then . If is an integer, the gamma distribution is an Erlang distribution and is the probability distribution of the waiting time until the -th "arrival" in a one-dimensional Poisson process with intensity 1/. If and , then . , , and . If X has a Maxwell-Boltzmann distribution with parameter a, then . , then follows a generalized gamma distribution with parameters , then

; i.e. an exponential distribution: see skew-logistic distribution.

Conjugate prior
In Bayesian inference, the gamma distribution is the conjugate prior to many likelihood distributions: the Poisson, exponential, normal (with known mean), Pareto, gamma with known shape , inverse gamma with known shape parameter, and Gompertz with known scale parameter. The Gamma distribution's conjugate prior is:[17]

Where Z is the normalizing constant, which has no closed form solution. The posterior distribution can be found by updating the parameters as follows.

Gamma distribution

185

Where

is the number of observations, and

is the

observation.

Compound gamma
If the shape parameter of the gamma distribution is known, but the inverse-scale parameter is unknown, then a gamma distribution for the inverse-scale forms a conjugate prior. The compound distribution, which results from integrating out the inverse-scale has a closed form solution, known as the compound gamma distribution.[18]

Others
If X has a (k, ) distribution, then 1/X has an inverse-gamma distribution with parameters k and 1. If X and Y are independently distributed (, ) and (, ) respectively, then X/(X+Y) has a beta distribution with parameters and . If Xi are independently distributed (i, 1) respectively, then the vector (X1/S,...,Xn/S), where S=X1+...+Xn, follows a Dirichlet distribution with parameters 1,,n. For large k the gamma distribution converges to Gaussian distribution with mean and variance . The Gamma distribution is the conjugate prior for the precision of the normal distribution with known mean. The Wishart distribution is a multivariate generalization of the gamma distribution (samples are positive-definite matrices rather than positive real numbers). The Gamma distribution is a special case of the generalized gamma distribution, the generalized integer gamma distribution, and the generalized inverse Gaussian distribution Among the discrete distributions, the negative binomial distribution is sometimes considered the discrete analogue of the Gamma distribution

Applications
The gamma distribution has been used to model the size of insurance claims and rainfalls.[19] This means that aggregate insurance claims and the amount of rainfall accumulated in a reservoir are modelled by a gamma process. The gamma distribution is also used to model errors in multi-level Poisson regression models, because the combination of the Poisson distribution and a gamma distribution is a negative binomial distribution. In neuroscience, the gamma distribution is often used to describe the distribution of inter-spike intervals.[20] Although in practice the gamma distribution often provides a good fit, there is no underlying biophysical motivation for using it. In bacterial gene expression, the copy number of a constitutively expressed protein often follows the gamma distribution, where the scale and shape parameter are, respectively, the mean number of bursts per cell cycle and the mean number of protein molecules produced by a single mRNA during its lifetime.[21] The gamma distribution is widely used as a conjugate prior in Bayesian statistics. It is the conjugate prior for the precision (i.e. inverse of the variance) of a normal distribution. It is also the conjugate prior for the exponential distribution.

Gamma distribution

186

Notes
[1] See Hogg and Craig (1978, Remark 3.3.1) for an explicit motivation [2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [3] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-k. png [4] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-Theta. png [5] http:/ / commons. wikimedia. org/ wiki/ File:Gamma-PDF-3D-by-x. png [6] Papoulis, Pillai, Probability, Random Variables, and Stochastic Processes, Fourth Edition [7] Chen J, Rubin H (1986) Bounds for the difference between median and mean of Gamma and Poisson distributions. Statist Probab Lett 4: 281283 [8] Choi KP (1994) On the medians of Gamma distributions and an equation of Ramanujan. Proc Amer Math Soc 121 (1) 245251 [9] Banneheka BMSG, Ekanayake GEMUPD (2009) A new point estimator for the median of Gamma distribution. Viyodaya J Science 14:95-103 [10] W.D. Penny, KL-Divergences of Normal, Gamma, Dirichlet, and Wishart densities [11] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag. . See Chapter 9, Section 3, pages 401428. [12] Devroye (1986), p. 406. [13] Ahrens, J. H. and Dieter, U. (1982). Generating gamma variates by a modified rejection technique. Communications of the ACM, 25, 4754. Algorithm GD, p. 53. [14] Ahrens, J. H.; Dieter, U. (1974). "Computer methods for sampling from gamma, beta, Poisson and binomial distributions". Computing 12: 223246. CiteSeerX: 10.1.1.93.3828 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 93. 3828). [15] Cheng, R.C.H., and Feast, G.M. Some simple gamma variate generators. Appl. Stat. 28 (1979), 290-295. [16] Marsaglia, G. The squeeze method for generating gamma variates. Comput, Math. Appl. 3 (1977), 321-325. [17] Fink, D. 1995 A Compendium of Conjugate Priors (http:/ / www. stat. columbia. edu/ ~cook/ movabletype/ mlm/ CONJINTRnew+ TEX. pdf). In progress report: Extension and enhancement of methods for setting data quality objectives. (DOE contract 95831). [18] Dubey, Satya D. (December 1970). "Compound gamma, beta and F distributions" (http:/ / www. springerlink. com/ content/ u750hg4630387205/ ). Metrika 16: 2731. doi:10.1007/BF02613934. . [19] Aksoy, H. (2000) "Use of Gamma Distribution in Hydrological Analysis" (http:/ / journals. tubitak. gov. tr/ engineering/ issues/ muh-00-24-6/ muh-24-6-7-9909-13. pdf), Turk J. Engin Environ Sci, 24, 419 428. [20] J. G. Robson and J. B. Troy, "Nature of the maintained discharge of Q, X, and Y retinal ganglion cells of the cat," J. Opt. Soc. Am. A 4, 2301-2307 (1987) [21] N. Friedman, L. Cai and X. S. Xie (2006) "Linking stochastic dynamics to population distribution: An analytical framework of gene expression," Phys. Rev. Lett. 97, 168302.

References
R. V. Hogg and A. T. Craig (1978) Introduction to Mathematical Statistics, 4th edition. New York: Macmillan. (See Section 3.3.)' S. C. Choi and R. Wette (1969) Maximum Likelihood Estimation of the Parameters of the Gamma Distribution and Their Bias, Technometrics, 11(4) 683690 P. G. Moschopoulos (1985) The distribution of the sum of independent gamma random variables, Annals of the Institute of Statistical Mathematics, 37, 541-544 A. M. Mathai (1982) Storage capacity of a dam with gamma type inputs, Annals of the Institute of Statistical Mathematics, 34, 591-597

Gamma distribution

187

External links
Hazewinkel, Michiel, ed. (2001), "Gamma-distribution" (http://www.encyclopediaofmath.org/index. php?title=p/g043300), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Weisstein, Eric W., " Gamma distribution (http://mathworld.wolfram.com/GammaDistribution.html)" from MathWorld. Engineering Statistics Handbook (http://www.itl.nist.gov/div898/handbook/eda/section3/eda366b.htm)

Pareto distribution
Pareto Type I Probability density function

Pareto Type I probability density functions for various (labeled "k") with xm=1. The horizontal axis is the x parameter. As the distribution approaches (xxm) where is the Dirac delta function. Cumulative distribution function

Pareto Type I cumulative distribution functions for various (labeled "k") with xm=1. The horizontal axis is the x parameter. Parameters Support PDF scale (real) shape (real)

Pareto distribution

188

CDF Mean

Median Mode Variance

Skewness Ex. kurtosis Entropy MGF CF Fisher information

The Pareto distribution, named after the Italian economist Vilfredo Pareto, is a power law probability distribution that coincides with social, scientific, geophysical, actuarial, and many other types of observable phenomena. Outside the field of economics it is sometimes referred to as the Bradford distribution.

Definition
If X is a random variable with a Pareto (Type I) distribution,[1] then the probability that X is greater than some number x, i.e. the survival function (also called tail function), is given by

where xm is the (necessarily positive) minimum possible value of X, and is a positive parameter. The Pareto Type I distribution is characterized by a scale parameter xm and a shape parameter , which is known as the tail index. When this distribution is used to model the distribution of wealth, then the parameter is called the Pareto index.

Properties
Cumulative distribution function
From the definition, the cumulative distribution function of a Pareto random variable with parameters and xm is

When plotted on linear axes, the distribution assumes the familiar J-shaped curve which approaches each of the orthogonal axes asymptotically. All segments of the curve are self-similar (subject to appropriate scaling factors). When plotted in a log-log plot, the distribution is represented by a straight line.

Pareto distribution

189

Probability density function


It follows (by differentiation) that the probability density function is

Moments and characteristic function


The expected value of a random variable following a Pareto distribution is

The variance of a random variable following a Pareto distribution is

(If

, the variance does not exist.)

The raw moments are

The moment generating function is only defined for non-positive values t0 as

The characteristic function is given by

where (a,x) is the incomplete gamma function.

Degenerate case
The Dirac delta function is a limiting case of the Pareto density:

Conditional distributions
The conditional probability distribution of a Pareto-distributed random variable, given the event that it is greater than or equal to a particular numberx1 exceedingxm, is a Pareto distribution with the same Pareto index but with minimumx1 instead ofxm.

Pareto distribution

190

A characterization theorem
Suppose Xi, i = 1, 2, 3, ... are independent identically distributed random variables whose probability distribution is supported on the interval [xm,) for some xm>0. Suppose that for all n, the two random variables min{X1,...,Xn} and (X1+...+Xn)/min{X1,...,Xn} are independent. Then the common distribution is a Pareto distribution.

Geometric mean
The geometric mean (G) is[2]

Harmonic mean
The harmonic mean (H) is[2]

Generalized Pareto distributions


There is a hierarchy [1][3] of Pareto distributions known as Pareto Type I, II, III, IV, and FellerPareto distributions.[1][3][4] Pareto Type IV contains Pareto Type I and II as special cases. The FellerPareto[3][5] distribution generalizes Pareto Type IV.

Pareto Types IIV


The Pareto distribution hierarchy is summarized in the table comparing the survival functions (complementary CDF). The Pareto distribution of the second kind is also known as the Lomax distribution.[6]

Pareto Distributions
Support Type I Type II Lomax Parameters

Type III

Type IV

The shape parameter is the tail index, is location, is scale, is an inequality parameter. Some special cases of Pareto Type (IV) are: and The finiteness of the mean, and the existence and the finiteness of the variance depend on the tail index (inequality index ). In particular, fractional -moments are finite for some >0, as shown in the table below, where is not necessarily an integer.

Pareto distribution

191

Moments of Pareto IIV Distributions (case =0)


Condition Type I Condition

Type II

Type III Type IV

FellerPareto distribution
Feller[3][5] defines a Pareto variable by transformation U=Y11 of a beta random variable Y, whose probability density function is

where B() is the beta function. If then W has a FellerPareto distribution FP(, , , 1, 2).[1] If (FP) variable is[7] and are independent Gamma variables, another construction of a FellerPareto

and we write W ~ FP(, , , 1, 2). Special cases of the FellerPareto distribution are

Applications
Pareto originally used this distribution to describe the allocation of wealth among individuals since it seemed to show rather well the way that a larger portion of the wealth of any society is owned by a smaller percentage of the people in that society. He also used it to describe distribution of income.[8] This idea is sometimes expressed more simply as the Pareto principle or the "80-20 rule" which says that 20% of the population controls 80% of the wealth.[9] However, the 80-20 rule corresponds to a particular value of , and in fact, Pareto's data on British income taxes in his Cours d'conomie politique indicates that about 30% of the population had about 70% of the income. The probability density function (PDF) graph at the beginning of this article shows that the "probability" or fraction of the population that owns a small amount of wealth per person is rather high, and then decreases steadily as wealth increases. (Note that the Pareto distribution is not realistic for wealth for the lower end. In fact, net worth may even be negative.) This distribution is not limited to describing wealth or income, but to many situations in which an equilibrium is found in the distribution of the "small" to the "large". The following examples are sometimes seen as approximately Pareto-distributed: The sizes of human settlements (few cities, many hamlets/villages)[10] File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones)[10]

Pareto distribution Hard disk drive error rates[11] Clusters of BoseEinstein condensate near absolute zero The values of oil reserves in oil fields (a few large fields, many small fields)[10] The length distribution in jobs assigned supercomputers (a few large ones, many small ones) The standardized price returns on individual stocks [10]

192

Sizes of sand particles [10] Sizes of meteorites Numbers of species per genus (There is subjectivity involved: The tendency to divide a genus into two or more increases with the number of species in it) Areas burnt in forest fires Severity of large casualty losses for certain lines of business such as general liability, commercial auto, and workers compensation.[12][13] In hydrology the Pareto distribution is applied to extreme events such as annually maximum one-day rainfalls and river discharges. The blue picture Fitted cumulative Pareto distribution to maximum one-day rainfalls illustrates an example of fitting the Pareto distribution to ranked annually maximum one-day rainfalls showing also the 90% confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part of the cumulative frequency analysis.

Relation to other distributions


Relation to the exponential distribution
The Pareto distribution is related to the exponential distribution as follows. If X is Pareto-distributed with minimum xm and index, then

is exponentially distributed with intensity (rate parameter). Equivalently, if Y is exponentially distributed with intensity, then

is Pareto-distributed with minimum xm and index. This can be shown using the standard change of variable techniques:

The last expression is the cumulative distribution function of an exponential distribution with intensity.

Relation to the log-normal distribution


Note that the Pareto distribution and log-normal distribution are alternative distributions for describing the same types of quantities. One of the connections between the two is that they are both the distributions of the exponential of random variables distributed according to other common distributions, respectively the exponential distribution and normal distribution. (Both of these latter two distributions are "basic" in the sense that the logarithms of their density functions are linear and quadratic, respectively, functions of the observed values.)

Pareto distribution

193

Relation to the generalized Pareto distribution


The Pareto distribution is a special case of the generalized Pareto distribution, which is a family of distributions of similar form, but containing an extra parameter in such a way that the support of the distribution is either bounded below (at a variable point), or bounded both above and below (where both are variable), with the Lomax distribution as a special case. This family also contains both the unshifted and shifted exponential distributions.

Relation to Zipf's law


Pareto distributions are continuous probability distributions. Zipf's law, also sometimes called the zeta distribution, may be thought of as a discrete counterpart of the Pareto distribution.

Relation to the "Pareto principle"


The "80-20 law", according to which 20 % of all people receive 80 % of all income, and 20 % of the most affluent 20 % receive 80 % of that 80 %, and so on, holds precisely when the Pareto index is=log4(5)=log(5)/log(4), approximately 1.161. This result can be derived from the Lorenz curve formula given below. Moreover, the following have been shown[14] to be mathematically equivalent: Income is distributed according to a Pareto distribution with index >1. There is some number 0p1/2 such that 100p % of all people receive 100(1p) % of all income, and similarly for every real (not necessarily integer) n>0, 100pn % of all people receive 100(1p)n % of all income. This does not apply only to income, but also to wealth, or to anything else that can be modeled by this distribution. This excludes Pareto distributions in which0<1, which, as noted above, have infinite expected value, and so cannot reasonably model income distribution.

Lorenz curve and Gini coefficient


The Lorenz curve is often used to characterize income and wealth distributions. For any distribution, the Lorenz curve L(F) is written in terms of the PDF or the CDF F as

where x(F) is the inverse of the CDF. For the Pareto distribution,

and the Lorenz curve is calculated to be

where must be greater than or equal Lorenz curves for a number of Pareto distributions. The case = corresponds to perfectly equal distribution (G=0) and the =1 line corresponds to complete inequality to unity, since the denominator in the (G=1) expression for L(F) is just the mean value ofx. Examples of the Lorenz curve for a number of Pareto distributions are shown in the graph on the right. The Gini coefficient is a measure of the deviation of the Lorenz curve from the equidistribution line which is a line connecting [0,0] and [1,1], which is shown in black (=) in the Lorenz plot on the right. Specifically, the Gini

Pareto distribution coefficient is twice the area between the Lorenz curve and the equidistribution line. The Gini coefficient for the Pareto distribution is then calculated to be

194

(see Aaberge 2005).

Parameter estimation
The likelihood function for the Pareto distribution parameters and xm, given a sample x =(x1,x2,...,xn), is

Therefore, the logarithmic likelihood function is

It can be seen that

is monotonically increasing with

, that is, the greater the value of

, the greater

the value of the likelihood function. Hence, since

, we conclude that

To find the estimator for , we compute the corresponding partial derivative and determine where it is zero:

Thus the maximum likelihood estimator for is:

The expected statistical error is:


[15]

Graphical representation
The characteristic curved 'Long Tail' distribution when plotted on a linear scale, masks the underlying simplicity of the function when plotted on a log-log graph, which then takes the form of a straight line with negative gradient: It follows from the formula for the probability density function that for ,

Since

is positive, the gradient

is negative.

Pareto distribution

195

Random sample generation


Random samples can be generated using inverse transform sampling. Given a random variate U drawn from the uniform distribution on the unit interval (0,1], the variate T given by

is Pareto-distributed.[16] If U is uniformly distributed on [0, 1), it can be exchanged for (1 - U).

Variants
Bounded Pareto distribution
Bounded Pareto Parameters location (real) location (real) shape (real)

Support PDF

CDF

Mean

Median

Variance

The bounded (or truncated) Pareto distribution has three parameters , L and H. As in the standard Pareto distribution determines the shape. L denotes the minimal value, and H denotes the maximal value. (The variance in the table on the right should be interpreted as 2nd moment). The probability density function is

where LxH, and >0.

Pareto distribution Generating bounded Pareto random variables If U is uniformly distributed on (0,1), then Applying Inverse Transform Method [17]

196

is bounded Pareto-distributed.

Symmetric Pareto distribution


The symmetric Pareto distribution can be defined by the probability density function:[18]

It has a similar shape to a Pareto distribution for

and is mirror symmetric about the vertical axis.

Notes
[1] [2] [3] [4] [5] [6] Barry C. Arnold (1983). Pareto Distributions. International Co-operative Publishing House. ISBN0-89974-012-X. Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions Vol 1. Wiley Series in Probability and Statistics. Johnson, Kotz, and Balakrishnan (1994), (20.4). Christian Kleiber and Samuel Kotz (2003). Statistical Size Distributions in Economics and Actuarial Sciences. Wiley. ISBN0-471-15064-9. Feller, W. (1971). An Introduction to Probability Theory and its Applications, 2 (Second edition), New York: Wiley. Lomax, K. S. (1954). Business failures. Another example of the analysis of failure data.Journal of the American Statistical Association, 49, 847852. [7] Chotikapanich, Duangkamon. "Chapter 7: Pareto and Generalized Pareto Distributions" (http:/ / books. google. com/ books?id=fUJZZLj1kbwC). Modeling Income Distributions and Lorenz Curves. pp.121122. . [8] Pareto, Vilfredo, Cours dconomie Politique: Nouvelle dition par G.-H. Bousquet et G. Busino, Librairie Droz, Geneva, 1964, pages 299345. [9] For a two-quantile population, where approximately 18% of the population owns 82% of the wealth, the Theil index takes the value 1. [10] Reed, William J.; et al. (2004). "The Double Pareto-Lognormal Distribution A New Parametric Model for Size Distributions". Communications in Statistics : Theory and Methods 33 (8): 17331753. CiteSeerX: 10.1.1.70.4555 (http:/ / citeseerx. ist. psu. edu/ viewdoc/ summary?doi=10. 1. 1. 70. 4555). [11] Schroeder, Bianca; Damouras, Sotirios; Gill, Phillipa (2010-02-24). "Understanding latent sector error and how to protect against them" (http:/ / www. usenix. org/ event/ fast10/ tech/ full_papers/ schroeder. pdf). 8th Usenix Conference on File and Storage Technologies (FAST 2010). . Retrieved 2010-09-10. "We experimented with 5 different distributions (Geometric,Weibull, Rayleigh, Pareto, and Lognormal), that are commonly used in the context of system reliability, and evaluated their t through the total squared differences between the actual and hypothesized frequencies (2 statistic). We found consistently across all models that the geometric distribution is a poor t, while the Pareto distribution provides the best t." [12] Kleiber and Kotz (2003): page 94. [13] Seal, H. (1980). "Survival probabilities based on Pareto claim distributions". ASTIN Bulletin 11: 6171. [14] Hardy, Michael (2010). "Pareto's Law". Mathematical Intelligencer 32 (3): 3843. doi:10.1007/s00283-010-9159-2. [15] M. E. J. Newman (2005). "Power laws, Pareto distributions and Zipf's law". Contemporary Physics 46 (5): 323351. arXiv:cond-mat/0412004. Bibcode2005ConPh..46..323N. doi:10.1080/00107510500052444. [16] Tanizaki, Hisashi (2004). Computational Methods in Statistics and Econometrics (http:/ / books. google. com/ books?id=pOGAUcn13fMC& printsec=frontcover). CRC Press. p.133. . [17] http:/ / www. cs. bgu. ac. il/ ~mps042/ invtransnote. htm [18] Grabchak, M. & Samorodnitsky, D.. "Do Financial Returns Have Finite or Infinite Variance? A Paradox and an Explanation" (http:/ / people. orie. cornell. edu/ ~gennady/ techreports/ RetTailParadoxExplFinal. pdf). pp.78. .

Pareto distribution

197

References
M. O. Lorenz (1905). "Methods of measuring the concentration of wealth". Publications of the American Statistical Association 9 (70): 209219. Bibcode1905PAmSA...9..209L. doi:10.2307/2276207. Pareto V (1965) "La Courbe de la Repartition de la Richesse" (Originally published in 1896). In: Busino G, editor. Oevres Completes de Vilfredo Pareto. Geneva: Librairie Droz. pp.15. Pareto, V. (1895). La legge della domanda. Giornale degli Economisti, 10, 5968. English translation in Rivista di Politica Economica, 87 (1997), 691700. Pareto, V. (1897). Cours d'conomie politique. Lausanne: Ed. Rouge.

External links
The Pareto, Zipf and other power laws / William J. Reed PDF (http://linkage.rockefeller.edu/wli/zipf/ reed01_el.pdf) Gini's Nuclear Family / Rolf Aaberg. In: International Conference to Honor Two Eminent Social Scientists (http://www.unisi.it/eventi/GiniLorenz05/), May, 2005 PDF (http://www.unisi.it/eventi/GiniLorenz05/ 25 may paper/PAPER_Aaberge.pdf) syntraf1.c (http://www.csee.usf.edu/~christen/tools/syntraf1.c) is a C program to generate synthetic packet traffic with bounded Pareto burst size and exponential interburst time. "Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes" /Mark E. Crovella and Azer Bestavros (http://www.cs.bu.edu/~crovella/paper-archive/self-sim/journal-version.pdf) Weisstein, Eric W., " Pareto distribution (http://mathworld.wolfram.com/ParetoDistribution.html)" from MathWorld.

Inverse-gamma distribution

198

Inverse-gamma distribution
Inverse-gamma Probability density function

Cumulative distribution function

Parameters Support PDF CDF Mean Mode Variance Skewness Ex. kurtosis Entropy MGF CF

shape (real) scale (real)

for

for for for

In probability theory and statistics, the inverse gamma distribution is a two-parameter family of continuous probability distributions on the positive real line, which is the distribution of the reciprocal of a variable distributed according to the gamma distribution. Perhaps the chief use of the inverse gamma distribution is in Bayesian statistics, where it serves as the conjugate prior of the variance of a normal distribution. However, it is common among Bayesians to consider an alternative parametrization of the normal distribution in terms of the precision,

Inverse-gamma distribution defined as the reciprocal of the variance, which allows the gamma distribution to be used directly as a conjugate prior.

199

Characterization
Probability density function
The inverse gamma distribution's probability density function is defined over the support

with shape parameter

and scale parameter

Cumulative distribution function


The cumulative distribution function is the regularized gamma function

where the numerator is the upper incomplete gamma function and the denominator is the gamma function. Many math packages allow you to compute Q, the regularized gamma function, directly.

Properties
For and

where

is the digamma function.

Related distributions
If If If If then then then then (inverse-chi-squared distribution) (scaled-inverse-chi-squared distribution) (Lvy distribution)

If (Gamma distribution) then Inverse gamma distribution is a special case of type 5 Pearson distribution A multivariate generalization of the inverse-gamma distribution is the inverse-Wishart distribution. For the distribution of a sum of independent inverted Gamma variables see Witkovsky (2001)

Inverse-gamma distribution

200

Derivation from Gamma distribution


The pdf of the gamma distribution is

and define the transformation

then the resulting transformation is

Replacing

with

with

; and

with

results in the inverse-gamma pdf shown above

References
V. Witkovsky (2001) Computing the distribution of a linear combination of inverted gamma variables, Kybernetika 37(1), 79-90

Chi-squared distribution

201

Chi-squared distribution
Probability density function

Cumulative distribution function

Notation Parameters Support PDF CDF

or (known as "degrees of freedom") x [0, +)

Mean Median Mode Variance Skewness

max{ k 2, 0 } 2k

Ex. kurtosis 12 / k Entropy MGF CF (1 2 t)k/2 for t < (1 2 i t)k/2


[1]

In probability theory and statistics, the chi-squared distribution (also chi-square or -distribution) with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. It is one of the most widely used probability distributions in inferential statistics, e.g., in hypothesis testing or in construction of confidence intervals.[2][3][4][5] When there is a need to contrast it with the noncentral chi-squared distribution, this distribution is sometimes called the central chi-squared distribution.

Chi-squared distribution The chi-squared distribution is used in the common chi-squared tests for goodness of fit of an observed distribution to a theoretical one, the independence of two criteria of classification of qualitative data, and in confidence interval estimation for a population standard deviation of a normal distribution from a sample standard deviation. Many other statistical tests also use this distribution, like Friedman's analysis of variance by ranks. The chi-squared distribution is a special case of the gamma distribution.

202

Definition
If Z1, ..., Zk are independent, standard normal random variables, then the sum of their squares,

is distributed according to the chi-squared distribution with k degrees of freedom. This is usually denoted as

The chi-squared distribution has one parameter: k a positive integer that specifies the number of degrees of freedom (i.e. the number of Zis)

Characteristics
Further properties of the chi-squared distribution can be found in the box at the upper right corner of this article.

Probability density function


The probability density function (pdf) of the chi-squared distribution is

where (k/2) denotes the Gamma function, which has closed-form values for odd k. For derivations of the pdf in the cases of one and two degrees of freedom, see Proofs related to chi-squared distribution.

Cumulative distribution function


Its cumulative distribution function is:

where (k,z) is the lower incomplete Gamma function and P(k,z) is the regularized Gamma function. In a special case of k = 2 this function has a simple form:

For the cases when 0 < z < 1 (which include all of the cases when this CDF is less than half), the following Chernoff upper bound may be obtained:[6]

The tail bound for the cases when z > 1 follows similarly

Tables of this cumulative distribution function are widely available and the function is included in many spreadsheets and all statistical packages. For another approximation for the CDF modeled after the cube of a Gaussian, see under Noncentral chi-squared distribution.

Chi-squared distribution

203

Additivity
It follows from the definition of the chi-squared distribution that the sum of independent chi-squared variables is also chi-squared distributed. Specifically, if {Xi}i=1n are independent chi-squared variables with {ki}i=1n degrees of freedom, respectively, then Y = X1 + + Xn is chi-squared distributed with k1 + + kn degrees of freedom.

Information entropy
The information entropy is given by

where (x) is the Digamma function. The Chi-squared distribution is the maximum entropy probability distribution for a random variate X for which and are fixed. [7]

Noncentral moments
The moments about zero of a chi-squared distribution with k degrees of freedom are given by[8][9]

Cumulants
The cumulants are readily obtained by a (formal) power series expansion of the logarithm of the characteristic function:

Asymptotic properties
By the central limit theorem, because the chi-squared distribution is the sum of k independent random variables with finite mean and variance, it converges to a normal distribution for large k. For many practical purposes, for k>50 the distribution is sufficiently close to a normal distribution for the difference to be ignored.[10] Specifically, if X~(k), then as k tends to infinity, the distribution of tends to a standard normal distribution. However, convergence is slow as the skewness is and the excess kurtosis is 12/k. Other functions of the

chi-squared distribution converge more rapidly to a normal distribution. Some examples are: If X ~ (k) then to R. A. Fisher). If X ~ (k) then is approximately normally distributed with mean and variance
[11]

is approximately normally distributed with mean

and unit variance (result credited This

is known as the Wilson-Hilferty transformation.

Chi-squared distribution

204

Relation to other distributions


As If As a special case, if If If If If If then and then , (normal distribution) (Noncentral chi-squared distribution with non-centrality parameter then then has the chi-squared distribution has the chi-squared distribution )

(The squared norm of n standard normally distributed variables is a chi-squared distribution with k degrees of freedom) , then (chi distribution) (Rayleigh distribution) then (Maxwell distribution) then (Inverse-chi-squared distribution) . (gamma distribution)

The chi-squared distribution is a special case of type 3 Pearson distribution If If If and are independent then (beta distribution)

(uniform distribution) then is a transformation of Laplace distribution then

chi-squared distribution is a transformation of Pareto distribution Student's t-distribution is a transformation of chi-squared distribution Student's t-distribution can be obtained from chi-squared distribution and normal distribution Noncentral beta distribution can be obtained as a transformation of chi-squared distribution and Noncentral chi-squared distribution Noncentral t-distribution can be obtained from normal distribution and chi-squared distribution A chi-squared variable with k degrees of freedom is defined as the sum of the squares of k independent standard normal random variables. If Y is a k-dimensional Gaussian random vector with mean vector and rank k covariance matrix C, then X=(Y)TC1(Y) is chi-squared distributed with k degrees of freedom. The sum of squares of statistically independent unit-variance Gaussian variables which do not have mean zero yields a generalization of the chi-squared distribution called the noncentral chi-squared distribution. If Y is a vector of k i.i.d. standard normal random variables and A is a kk idempotent matrix with rank kn then the quadratic form YTAY is chi-squared distributed with kn degrees of freedom. The chi-squared distribution is also naturally related to other distributions arising from the Gaussian. In particular, Y is F-distributed, Y~F(k1,k2) if If X is chi-squared distributed, then where X1~(k1) and X2 ~(k2) are statistically independent. is chi distributed.

If X1 ~ 2k1 and X2 ~ 2k2 are statistically independent, then X1 + X2 ~2k1+k2. If X1 and X2 are not independent, then X1 + X2 is not chi-squared distributed.

Chi-squared distribution

205

Generalizations
The chi-squared distribution is obtained as the sum of the squares of k independent, zero-mean, unit-variance Gaussian random variables. Generalizations of this distribution can be obtained by summing the squares of other types of Gaussian random variables. Several such distributions are described below.

Chi-squared distributions
Noncentral chi-squared distribution The noncentral chi-squared distribution is obtained from the sum of the squares of independent Gaussian random variables having unit variance and nonzero means. Generalized chi-squared distribution The generalized chi-squared distribution is obtained from the quadratic form zAz where z is a zero-mean Gaussian vector having an arbitrary covariance matrix, and A is an arbitrary matrix.

Gamma, exponential, and related distributions


The chi-squared distribution X~(k) is a special case of the gamma distribution, in that X~(k/2,1/2) (using the shape parameterization of the gamma distribution) where k is an integer. Because the exponential distribution is also a special case of the Gamma distribution, we also have that if X~(2), then X~Exp(1/2) is an exponential distribution. The Erlang distribution is also a special case of the Gamma distribution and thus we also have that if X~(k) with even k, then X is Erlang distributed with shape parameter k/2 and scale parameter 1/2.

Applications
The chi-squared distribution has numerous applications in inferential statistics, for instance in chi-squared tests and in estimating variances. It enters the problem of estimating the mean of a normally distributed population and the problem of estimating the slope of a regression line via its role in Students t-distribution. It enters all analysis of variance problems via its role in the F-distribution, which is the distribution of the ratio of two independent chi-squared random variables, each divided by their respective degrees of freedom. Following are some of the most common situations in which the chi-squared distribution arises from a Gaussian-distributed sample. if X1, ..., Xn are i.i.d. N(, 2) random variables, then where .

The box below shows probability distributions with name starting with chi for some statistics based on Xi Normal(i, 2i), i = 1, , k, independent random variables:

Chi-squared distribution

206

Name chi-squared distribution

Statistic

noncentral chi-squared distribution

chi distribution

noncentral chi distribution

Table of 2 value vs p-value


The p-value is the probability of observing a test statistic at least as extreme in a chi-squared distribution. Accordingly, since the cumulative distribution function (CDF) for the appropriate degrees of freedom (df) gives the probability of having obtained a value less extreme than this point, subtracting the CDF value from 1 gives the p-value. The table below gives a number of p-values matching to 2 for the first 10 degrees of freedom. A p-value of 0.05 or less is usually regarded as statistically significant, i.e. the observed deviation from the null hypothesis is significant.
Degrees of freedom (df) 1 2 3 4 5 6 7 8 9 10 2 value 0.004 0.02 0.06 0.15 0.46 1.07 0.10 0.35 0.71 1.14 1.63 2.17 2.73 3.32 3.94 0.21 0.45 0.71 1.39 2.41 0.58 1.01 1.42 2.37 3.66 1.06 1.65 2.20 3.36 4.88 1.61 2.34 3.00 4.35 6.06 2.20 3.07 3.83 5.35 7.23 2.83 3.82 4.67 6.35 8.38 3.49 4.59 5.53 7.34 9.52 [12] 1.64 3.22 4.64 5.99 7.29 8.56 9.80 2.71 4.60 6.25 7.78 9.24 3.84 5.99 7.82 9.49 6.64 9.21 10.83 13.82

11.34 16.27 13.28 18.47

11.07 15.09 20.52

10.64 12.59 16.81 22.46 12.02 14.07 18.48 24.32

11.03 13.36 15.51 20.09 26.12

4.17 5.38 6.39 8.34 10.66 12.24 14.68 16.92 21.67 27.88 4.86 6.18 7.27 9.34 11.78 13.44 15.99 18.31 23.21 29.59 0.90 0.80 0.70 0.50 0.30 Nonsignificant 0.20 0.10 0.05 0.01 0.001

P value (Probability) 0.95

Significant

Chi-squared distribution

207

History
This distribution was first described by the German statistician Helmert.

References
[1] M.A. Sanders. "Characteristic function of the central chi-squared distribution" (http:/ / www. planetmathematics. com/ CentralChiDistr. pdf). . Retrieved 2009-03-06. [2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_940. htm), Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp.940, ISBN978-0486612720, MR0167642, . [3] NIST (2006). Engineering Statistics Handbook - Chi-Squared Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3666. htm) [4] Jonhson, N.L.; S. Kotz, , N. Balakrishnan (1994). Continuous Univariate Distributions (Second Ed., Vol. 1, Chapter 18). John Willey and Sons. ISBN0-471-58495-9. [5] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 241-246). McGraw-Hill. ISBN0-07-042864-6. [6] Dasgupta, Sanjoy D. A.; Gupta, Anupam K. (2002). "An Elementary Proof of a Theorem of Johnson and Lindenstrauss" (http:/ / cseweb. ucsd. edu/ ~dasgupta/ papers/ jl. pdf). Random Structures and Algorithms 22: 60-65. . Retrieved 2012-05-01. [7] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [8] Chi-squared distribution (http:/ / mathworld. wolfram. com/ Chi-SquaredDistribution. html), from MathWorld, retrieved Feb. 11, 2009 [9] M. K. Simon, Probability Distributions Involving Gaussian Random Variables, New York: Springer, 2002, eq. (2.35), ISBN 978-0-387-34657-1 [10] Box, Hunter and Hunter. Statistics for experimenters. Wiley. p.46. [11] Wilson, E.B.; Hilferty, M.M. (1931) "The distribution of chi-squared". Proceedings of the National Academy of Sciences, Washington, 17, 684688. [12] Chi-Squared Test (http:/ / www2. lv. psu. edu/ jxm57/ irp/ chisquar. html) Table B.2. Dr. Jacqueline S. McLaughlin at The Pennsylvania State University. In turn citing: R.A. Fisher and F. Yates, Statistical Tables for Biological Agricultural and Medical Research, 6th ed., Table IV

External links
Hazewinkel, Michiel, ed. (2001), "Chi-squared distribution" (http://www.encyclopediaofmath.org/index. php?title=p/c022100), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Earliest Uses of Some of the Words of Mathematics: entry on Chi squared has a brief history (http://jeff560. tripod.com/c.html) Course notes on Chi-Squared Goodness of Fit Testing (http://www.stat.yale.edu/Courses/1997-98/101/chigf. htm) from Yale University Stats 101 class. Mathematica demonstration showing the chi-squared sampling distribution of various statistics, e.g. x, for a normal population (http://demonstrations.wolfram.com/StatisticsAssociatedWithNormalSamples/) Simple algorithm for approximating cdf and inverse cdf for the chi-squared distribution with a pocket calculator (http://www.jstor.org/stable/2348373)

F-distribution

208

F-distribution
Fisher-Snedecor Probability density function

Cumulative distribution function

Parameters Support PDF

deg. of freedom

CDF Mean Mode Variance Skewness for for for

for Ex. kurtosis see text

F-distribution

209
MGF CF
[1][2]

does not exist, raw moments defined in text and in see text

In probability theory and statistics, the F-distribution is a continuous probability distribution.[1][2][3][4] It is also known as Snedecor's F distribution or the Fisher-Snedecor distribution (after R.A. Fisher and George W. Snedecor). The F-distribution arises frequently as the null distribution of a test statistic, most notably in the analysis of variance; see F-test.

Definition
If a random variable has an F-distribution with parameters is given by and , we write . Then the probability density function for

for real

. Here

is the beta function. In many applications, the parameters

and

are positive integers,

but the distribution is well-defined for positive real values of these parameters. The cumulative distribution function is

where I is the regularized incomplete beta function. The expectation, variance, and other details about the kurtosis is . The k-th moment of an distribution exists and is finite only when and it is equal to
[5]

are given in the sidebox; for

, the excess

The F-distribution is a particular parametrization of the beta prime distribution, which is also called the beta distribution of the second kind. The characteristic function is listed incorrectly in many standard references (e.g., [2]). The correct expression [6] is

where

is the confluent hypergeometric function of the second kind.

F-distribution

210

Characterization
A random variate of the F-distribution with parameters d1 and d2 arises as the ratio of two appropriately scaled chi-squared variates:

where U1 and U2 have chi-squared distributions with d1 and d2 degrees of freedom respectively, and U1 and U2 are independent. In instances where the F-distribution is used, for instance in the analysis of variance, independence of U1 and U2 might be demonstrated by applying Cochran's theorem.

Generalization
A generalization of the (central) F-distribution is the noncentral F-distribution.

Related distributions and properties


If If Equivalently, if If then and , and are independent, then (Beta distribution) then , then has the chi-squared distribution .

is equivalent to the scaled Hotelling's T-squared distribution .

If

then

. . . (Laplace distribution) then

If (Student's t-distribution) then If (Student's t-distribution) then F-distribution is a special case of type 6 Pearson distribution If X and Y are independent, with and

If then (Fisher's z-distribution) The noncentral F-distribution simplifies to the F-distribution if The doubly noncentral F-distribution simplifies to the F-distribution if If , then . is the quantile for and is the quantile for

F-distribution

211

References
[1] Johnson, Norman Lloyd; Samuel Kotz, N. Balakrishnan (1995). Continuous Univariate Distributions, Volume 2 (Second Edition, Section 27). Wiley. ISBN0-471-58494-0. [2] Abramowitz, Milton; Stegun, Irene A., eds. (1965), "Chapter 26" (http:/ / www. math. sfu. ca/ ~cbm/ aands/ page_946. htm), Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Dover, pp.946, ISBN978-0486612720, MR0167642, . [3] NIST (2006). Engineering Statistics Handbook - F Distribution (http:/ / www. itl. nist. gov/ div898/ handbook/ eda/ section3/ eda3665. htm) [4] Mood, Alexander; Franklin A. Graybill, Duane C. Boes (1974). Introduction to the Theory of Statistics (Third Edition, p. 246-249). McGraw-Hill. ISBN0-07-042864-6. [5] Taboga, Marco. "The F distribution" (http:/ / www. statlect. com/ F_distribution. htm). . [6] Phillips, P. C. B. (1982) "The true characteristic function of the F distribution," Biometrika, 69: 261-264 JSTOR2335882

External links
Table of critical values of the F-distribution (http://www.itl.nist.gov/div898/handbook/eda/section3/ eda3673.htm) Earliest Uses of Some of the Words of Mathematics: entry on F-distribution contains a brief history (http:// jeff560.tripod.com/f.html)

Log-normal distribution

212

Log-normal distribution
Log-normal Probability density function

Some log-normal density functions with identical location parameter but differing scale parameters Cumulative distribution function

Cumulative distribution function of the log-normal distribution (with = 0 ) Notation Parameters 2 > 0 shape (real), R log-scale x (0, +)

Support PDF CDF

Log-normal distribution

213

Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF Fisher information (defined only on the negative half-axis, see text) representation is asymptotically divergent but sufficient for numerical purposes

In probability theory, a log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. If X is a random variable with a normal distribution, then Y=exp(X) has a log-normal distribution; likewise, if Y is log-normally distributed, then X=log(Y) has a normal distribution. The log-normal distribution is the distribution of a random variable that takes only positive real values. Log-normal is also written lognormal or lognormal. The distribution is occasionally referred to as the Galton distribution or Galton's distribution, after Francis Galton,[1] and other names such as McAlister, Gibrat and CobbDouglas been associated.[1] A variable might be modeled as log-normal if it can be thought of as the multiplicative product of many independent random variables each of which is positive. (This is justified by considering the central limit theorem in the log-domain.) For example, in finance, the variable could represent the compound return from a sequence of many trades (each expressed as its return + 1); or a long-term discount factor can be derived from the product of short-term discount factors. In wireless communication, the attenuation caused by shadowing or slow fading from random objects is often assumed to be log-normally distributed: see log-distance path loss model. The log-normal distribution is the maximum entropy probability distribution for a random variate X for which the mean and variance of is fixed.[2]

and
In a log-normal distribution X, the parameters denoted and are, respectively, the mean and standard deviation of the variables natural logarithm (by definition, the variables logarithm is normally distributed), which means

with Z a standard normal variable. This relationship is true regardless of the base of the logarithmic or exponential function. If loga(Y) is normally distributed, then so is logb(Y), for any two positive numbers a,b1. Likewise, if is normally distributed, then so is , where a is a positive number 1. On a logarithmic scale, and can be called the location parameter and the scale parameter, respectively. In contrast, the mean and standard deviation of the non-logarithmized sample values are denoted m and s.d. in this article.

Log-normal distribution

214

Characterization
Probability density function
The probability density function of a log-normal distribution is:[1]

This follows by applying the change-of-variables rule on the density function of a normal distribution.

Cumulative distribution function


The cumulative distribution function is

where erfc is the complementary error function, and is the cumulative distribution function of the standard normal distribution.

Characteristic function and moment generating function


The characteristic function, E[eitX], has a number of representations. The integral itself converges for Im(t)0. The simplest representation is obtained by Taylor expanding eitX and using formula for moments below, giving

This series representation is divergent for Re(2)>0. However, it is sufficient for evaluating the characteristic function numerically at positive as long as the upper limit in the sum above is kept bounded, nN, where

and 2<0.1. To bring the numerical values of parameters , into the domain where strong inequality holds true one could use the fact that if X is log-normally distributed then Xm is also log-normally distributed with parameters m,m. Since , the inequality could be satisfied for sufficiently smallm. The sum of series first converges to the value of (t) with arbitrary high accuracy if m is small enough, and left part of the strong inequality is satisfied. If considerably larger number of terms are taken into account the sum eventually diverges when the right part of the strong inequality is no longer valid. Another useful representation is available[3][4] by means of double Taylor expansion of e(lnx)2/(22). The moment-generating function for the log-normal distribution does not exist on the domain R, but only exists on the half-interval (,0].

Properties
Location and scale
For the log-normal distribution, the location and scale properties of the distribution are more readily treated using the geometric mean and geometric standard deviation than the arithmetic mean and standard deviation. Geometric moments The geometric mean of the log-normal distribution is . Because the log of a log-normal variable is symmetric and quantiles are preserved under monotonic transformations, the geometric mean of a log-normal distribution is equal to its median.[5]

Log-normal distribution The geometric mean (mg) can alternatively be derived from the arithmetic mean (ma) in a log-normal distribution by:

215

The geometric standard deviation is equal to Arithmetic moments

If X is a lognormally distributed variable, its expected value (E the arithmetic mean), variance (Var), and standard deviation (s.d.) are

Equivalently, parameters and can be obtained if the expected value and variance are known:

For any real or complex number s, the sth moment of log-normal X is given by[1]

A log-normal distribution is not uniquely determined by its moments E[Xk] for k1, that is, there exists some other distribution with the same moments for all k.[1] In fact, there is a whole family of distributions with the same moments as the log-normal distribution.

Mode and median


The mode is the point of global maximum of the probability density function. In particular, it solves the equation (ln)=0:

The median is such a point where FX=1/2:

Coefficient of variation
The coefficient of variation is the ratio s.d. over m (on the natural scale) and is equal to:

Partial expectation

The partial expectation of a random variable X with respect to a threshold k is defined as g(k) = E[X|X>k]P[X>k]. For a log-normal random variable the partial expectation is given by

Comparison of mean, median and mode of two log-normal distributions with different skewness.

Log-normal distribution This formula has applications in insurance and economics, it is used in solving the partial differential equation leading to the BlackScholes formula.

216

Other
A set of data that arises from the log-normal distribution has a symmetric Lorenz curve (see also Lorenz asymmetry coefficient).[6] The harmonic (H), geometric (G) and arithmetic (A) means of this distribution are related;[7] such relation is given by

Log-normal distributions are infinitely divisible.[1]

Occurrence
In biology, variables whose logarithms tend to have a normal distribution include: Measures of size of living tissue (length, skin area, weight);[8] The length of inert appendages (hair, claws, nails, teeth) of biological specimens, in the direction of growth; Certain physiological measurements, such as blood pressure of adult humans (after separation on male/female subpopulations)[9] Consequently, reference ranges for measurements in healthy individuals are more accurately estimated by assuming a log-normal distribution than by assuming a symmetric distribution about the mean. In hydrology, the log-normal distribution is used to analyze extreme values of such variables as monthly and annual maximum values of daily rainfall and river discharge volumes.[10] The image on the right illustrates an example of fitting the Fitted cumulative log-normal distribution to annually log-normal distribution to ranked annually maximum maximum 1-day rainfalls one-day rainfalls showing also the 90% confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part of a cumulative frequency analysis. In economics, there is evidence that the income of 97%99% of the population is distributed log-normally.[11] In finance, in particular the BlackScholes model, changes in the logarithm of exchange rates, price indices, and stock market indices are assumed normal[12] (these variables behave like compound interest, not like simple interest, and so are multiplicative). However, some mathematicians such as Benot Mandelbrot have argued that log-Levy distributions which possesses heavy tails would be a more appropriate model, in particular for the analysis for stock market crashes. Indeed stock price distributions typically exhibit a fat tail.[13] The distribution of city sizes is lognormal. This follows from Gibrat's law of proportionate (or scale-free) growth. Irrespective of their size, all cities follow the same stochastic growth process. As a result, the logarithm of city size is normally distributed. There is also evidence of lognormality in the firm size distribution and of Gibrat's law. In reliability analysis, the lognormal distribution is often used to model times to repair a maintainable system. In wireless communication, "the local-mean power expressed in logarithmic values, such as dB or neper, has a normal (i.e., Gaussian) distribution." [14] It has been proposed that coefficients of friction and wear may be treated as having a lognormal distribution [15]

Log-normal distribution

217

Maximum likelihood estimation of parameters


For determining the maximum likelihood estimators of the log-normal distribution parameters and , we can use the same procedure as for the normal distribution. To avoid repetition, we observe that

where by L we denote the probability density function of the log-normal distribution and by N that of the normal distribution. Therefore, using the same indices to denote distributions, we can write the log-likelihood function thus:

Since the first term is constant with regard to and , both logarithmic likelihood functions, L and N, reach their maximum with the same and. Hence, using the formulas for the normal distribution maximum likelihood parameter estimators and the equality above, we deduce that for the log-normal distribution it holds that

Multivariate log-normal
If distribution
[16]

is a multivariate normal distribution then with mean

has a multivariate log-normal

and covariance matrix

Generating log-normally distributed random variates


Given a random variate Z drawn from the normal distribution with 0 mean and 1 standard deviation, then the variate

has a log-normal distribution with parameters

and

Related distributions
If If If is a normal distribution, then is distributed log-normally, then is a normal random variable. , then Y is

are n independent log-normally distributed variables, and

also distributed log-normally:

Let parameters, and

be independent log-normally distributed variables with possibly varying and . The distribution of Y has no closed-form expression, but can be reasonably

approximated by another log-normal distribution Z at the right tail. Its probability density function at the neighborhood of 0 has been characterized[17] and it does not resemble any log-normal distribution. A commonly used approximation (due to Fenton and Wilkinson) is obtained by matching the mean and variance:

Log-normal distribution

218

In the case that all

have the same variance parameter

, these formulas simplify to

If If If If

, then X+c is said to have a shifted log-normal distribution with support x (c, +). , then , then then for

E[X+c] = E[X]+c, Var[X+c] =Var[X].

Lognormal distribution is a special case of semi-bounded Johnson distribution If with , then (Suzuki distribution)

Similar distributions
A substitute for the log-normal whose integral can be expressed in terms of more elementary functions (Swamee, 2002) can be obtained based on the logistic distribution to get the CDF

This is a log-logistic distribution.

Notes
[1] Johnson, Norman L.; Kotz, Samuel; Balakrishnan, N. (1994), "14: Lognormal Distributions", Continuous univariate distributions. Vol. 1, Wiley Series in Probability and Mathematical Statistics: Applied Probability and Statistics (2nd ed.), New York: John Wiley & Sons, ISBN978-0-471-58495-7, MR1299979 [2] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [3] Leipnik, Roy B. (1991), "On Lognormal Random Variables: I The Characteristic Function", Journal of the Australian Mathematical Society Series B, 32, 327347. [4] Daniel Dufresne (2009), SUMS OF LOGNORMALS (http:/ / www. soa. org/ library/ proceedings/ arch/ 2009/ arch-2009-iss1-dufresne. pdf,), Centre for Actuarial Studies, University of Melbourne. [5] Leslie E. Daly, Geoffrey Joseph Bourke (2000) Interpretation and uses of medical statistics (http:/ / books. google. se/ books?id=AY7LnYkiLNkC& pg=PA89) Edition: 5. Wiley-Blackwell ISBN 0-632-04763-1, ISBN 978-0-632-04763-5 (page 89) [6] Damgaard, Christian; Weiner, Jacob (2000). "Describing inequality in plant size or fecundity". Ecology 81 (4): 11391142. doi:10.1890/0012-9658(2000)081[1139:DIIPSO]2.0.CO;2. [7] Rossman LA (1990) "Design stream ows based on harmonic means". J Hydraulic Engineering ASCE 116 (7) 946950 [8] Huxley, Julian S. (1932). Problems of relative growth. London. ISBN0-486-61114-0. OCLC476909537. [9] Makuch, Robert W.; D.H. Freeman, M.F. Johnson (1979). "Justification for the lognormal distribution as a model for blood pressure" (http:/ / www. sciencedirect. com/ science/ article/ pii/ 0021968179900705). Journal of Chronic Diseases 32 (3): 245250. doi:10.1016/0021-9681(79)90070-5. (. . Retrieved 27 February 2012. [10] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. pp.175224. ISBN90-70754-33-9. .

Log-normal distribution
[11] Clementi, F.; Gallegati, M. (2005) "Pareto's law of income distribution: Evidence for Germany, the United Kingdom, and the United States" (http:/ / ideas. repec. org/ p/ wpa/ wuwpmi/ 0505006. html), EconWPA [12] Black, Fischer and Myron Scholes, "The Pricing of Options and Corporate Liabilities", Journal of Political Economy, Vol. 81, No. 3, (May/June 1973), pp. 637654. [13] Bunchen, P., Advanced Option Pricing, University of Sydney coursebook, 2007 [14] http:/ / wireless. per. nl/ reference/ chaptr03/ shadow/ shadow. htm [15] Steele, C. (2008). "Use of the lognormal distribution for the coefficients of friction and wear". Reliability Engineering & System Safety 93 (10): 15742013. doi:10.1016/j.ress.2007.09.005. [16] Tarmast, Ghasem (2001) "Multivariate LogNormal Distribution" (http:/ / isi. cbs. nl/ iamamember/ CD2/ pdf/ 329. PDF) ISI Proceedings: Seoul 53rd Session 2001 [17] Gao, X.; Xu, H; Ye, D. (2009), "Asymptotic Behaviors of Tail Density for Sum of Correlated Lognormal Variables" (http:/ / www. hindawi. com/ journals/ ijmms/ 2009/ 630857. html). International Journal of Mathematics and Mathematical Sciences, vol. 2009, Article ID 630857. doi:10.1155/2009/630857

219

References
Aitchison, J. and Brown, J.A.C. (1957) The Lognormal Distribution, Cambridge University Press. E. Limpert, W. Stahel and M. Abbt (2001) Log-normal Distributions across the Sciences: Keys and Clues (http:// stat.ethz.ch/~stahel/lognormal/bioscience.pdf), BioScience, 51 (5), 341352. Eric W. Weisstein et al. Log Normal Distribution (http://mathworld.wolfram.com/LogNormalDistribution. html) at MathWorld. Electronic document, retrieved October 26, 2006. Swamee, P.K. (2002). Near Lognormal Distribution (http://ascelibrary.org/doi/abs/10.1061/ (ASCE)1084-0699(2002)7:6(441)), Journal of Hydrologic Engineering. 7 (6): 441444 Swamee, P. K. (2002). "Near Lognormal Distribution". Journal of Hydrologic Engineering 7 (6): 441000. doi:10.1061/(ASCE)1084-0699(2002)7:6(441). Holgate, P. (1989). "The lognormal characteristic function". Communications in Statistics - Theory and Methods 18 (12): 45394548. doi:10.1080/03610928908830173.

Further reading
Robert Brooks, Jon Corson, and J. Donal Wales. "The Pricing of Index Options When the Underlying Assets All Follow a Lognormal Diffusion" (http://papers.ssrn.com/sol3/papers.cfm?abstract_id=5735), in Advances in Futures and Options Research, volume 7, 1994.

Exponential distribution

220

Exponential distribution
Exponential Probability density function

Cumulative distribution function

Parameters Support PDF CDF Mean Median Mode Variance Skewness Ex. kurtosis Entropy MGF CF

> 0 rate, or inverse scale x [0, ) ex 1 ex 1 1 ln 2 0 2 2 6 1 ln()

In probability theory and statistics, the exponential distribution (a.k.a. negative exponential distribution) is a family of continuous probability distributions. It describes the time between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate. It is the continuous analogue of the geometric distribution. Note that the exponential distribution is not the same as the class of exponential families of distributions, which is a large class of probability distributions that includes the exponential distribution as one of its members, but also includes the normal distribution, binomial distribution, gamma distribution, Poisson, and many others.

Exponential distribution

221

Characterization
Probability density function
The probability density function (pdf) of an exponential distribution is

Alternatively, this can be defined using the Heaviside step function, H(x).

Here >0 is the parameter of the distribution, often called the rate parameter. The distribution is supported on the interval [0,). If a random variable X has this distribution, we write X~Exp(). The exponential distribution exhibits infinite divisibility.

Cumulative distribution function


The cumulative distribution function is given by

Alternatively, this can be defined using the Heaviside step function,H(x).

Alternative parameterization
A commonly used alternative parameterization is to define the probability density function (pdf) of an exponential distribution as

where >0 is a scale parameter of the distribution and is the reciprocal of the rate parameter, , defined above. In this specification, is a survival parameter in the sense that if a random variable X is the duration of time that a given biological or mechanical system manages to survive and X~Exponential() then E[X]=. That is to say, the expected duration of survival of the system is units of time. The parameterisation involving the "rate" parameter arises in the context of events arriving at a rate, when the time between events (which might be modelled using an exponential distribution) has a mean of =1. The alternative specification is sometimes more convenient than the one given above, and some authors will use it as a standard definition. This alternative specification is not used here. Unfortunately this gives rise to a notational ambiguity. In general, the reader must check which of these two specifications is being used if an author writes "X~Exponential()", since either the notation in the previous (using) or the notation in this section (here, using to avoid confusion) could be intended.

Exponential distribution

222

Properties
Mean, variance, moments and median
The mean or expected value of an exponentially distributed random variable X with rate parameter is given by

In light of the examples given above, this makes sense: if you receive phone calls at an average rate of 2 per hour, then you can expect to wait half an hour for every call. The variance of X is given by

The moments of X, for n=1,2,..., are given by


The mean is the probability mass centre, that is the first moment.

The median of X is given by

where ln refers to the natural logarithm. Thus the absolute difference between the mean and median is

in accordance with the median-mean inequality.

Memorylessness
An important property of the exponential distribution is that it is memoryless. This means that if a random variable T is exponentially distributed, its conditional probability obeys

The median is the preimage

This says that the conditional probability that we need to wait, for example, more than another 10 seconds before the first arrival, given that the first arrival has not yet happened after 30 seconds, is equal to the initial probability that we need to wait more than 10 seconds for the first arrival. So, if we waited for 30 seconds and the first arrival didn't happen (T>30), probability that we'll need to wait another 10 seconds for the first arrival (T>30+10) is the same as the initial probability that we need to wait more than 10 seconds for the first arrival (T>10). The fact that Pr(T>40|T>30) =Pr(T>10) does not mean that the events T>40 and T>30 are independent. To summarize: "memorylessness" of the probability distribution of the waiting time T until the first arrival means

It does not mean

(That would be independence. These two events are not independent.) The exponential distributions and the geometric distributions are the only memoryless probability distributions.

Exponential distribution The exponential distribution is consequently also necessarily the only continuous probability distribution that has a constant Failure rate.

223

Quantiles
The quantile function (inverse cumulative distribution function) for Exponential() is

The quartiles are therefore: first quartile ln(4/3)/ median ln(2)/ third quartile ln(4)/

KullbackLeibler divergence
The directed KullbackLeibler divergence between Exp(0) ('true' distribution) and Exp() ('approximating' distribution) is given by

Maximum entropy distribution


Among all continuous probability distributions with support [0,) and mean , the exponential distribution with = 1/ has the largest entropy. Alternatively, it is the maximum entropy probability distribution for a random variate X for which is fixed and greater than zero.[1]

Distribution of the minimum of exponential random variables


Let X1, ..., Xn be independent exponentially distributed random variables with rate parameters 1, ..., n. Then is also exponentially distributed, with parameter

This can be seen by considering the complementary cumulative distribution function:

The index of the variable which achieves the minimum is distributed according to the law

Note that

is not exponentially distributed.

Exponential distribution

224

Parameter estimation
Suppose a given variable is exponentially distributed and the rate parameter is to be estimated.

Maximum likelihood
The likelihood function for , given an independent and identically distributed sample x = (x1, ..., xn) drawn from the variable, is

where

is the sample mean. The derivative of the likelihood function's logarithm is

Consequently the maximum likelihood estimate for the rate parameter is

While this estimate is the most likely reconstruction of the true parameter , it is only an estimate, and as such, one can imagine that the more data points are available the better the estimate will be. It so happens that one can compute an exact confidence interval that is, a confidence interval that is valid for all number of samples, not just large ones. The 100(1)% exact confidence interval for this estimate is given by[2]

which is also equal to:

where

is the MLE estimate, is the true value of the parameter, and 2p, is the 100(1 p) percentile of the chi

squared distribution with degrees of freedom.

Bayesian inference
The conjugate prior for the exponential distribution is the gamma distribution (of which the exponential distribution is a special case). The following parameterization of the gamma pdf is useful:

The posterior distribution p can then be expressed in terms of the likelihood function defined above and a gamma prior:

Exponential distribution

225

Now the posterior density p has been specified up to a missing normalizing constant. Since it has the form of a gamma pdf, this can easily be filled in, and one obtains

Here the parameter can be interpreted as the number of prior observations, and as the sum of the prior observations.

Confidence interval
A simple and rapid method to calculate an approximate confidence interval for the estimation of is based on the application of the central limit theorem.[3] This method provides a good approximation of the confidence interval limits, for samples containing at least 15 20 elements. Denoting by N the sample size, the upper and lower limits of the 95% confidence interval are given by:

Generating exponential variates


A conceptually very simple method for generating exponential variates is based on inverse transform sampling: Given a random variate U drawn from the uniform distribution on the unit interval (0,1), the variate has an exponential distribution, where F1 is the quantile function, defined by

Moreover, if U is uniform on (0,1), then so is 1U. This means one can generate exponential variates as follows:

Other methods for generating exponential variates are discussed by Knuth[4] and Devroye.[5] The ziggurat algorithm is a fast method for generating exponential variates. A fast method for generating a set of ready-ordered exponential variates without using a sorting routine is also available.[5]

Related distributions
Exponential distribution is closed under scaling by a positive factor. If If If and then then then

The Benktander Weibull distribution reduces to a truncated exponential distribution If then (Benktander Weibull distribution)

The exponential distribution is a limit of a scaled beta distribution:

Exponential distribution

226

If If If If (Laplace distribution) If If If If distribution) If If

then then then and and then then and then then

(Erlang distribution) (Generalized extreme value distribution) (gamma distribution) then then

(logistic distribution) then (Pareto distribution) (logistic

Exponential distribution is a special case of type 3 Pearson distribution If If If If If If then then then then (Uniform distribution (continuous)) then (Poisson distribution) where (geometric distribution) If and then (K-distribution) then (power law) (Rayleigh distribution) (Weibull distribution) (Weibull distribution)

The Hoyt distribution can be obtained from Exponential distribution and Arcsine distribution If If If and and , then then : see skew-logistic distribution. then

Y Gumbel(, ), i.e. Y has a Gumbel distribution, if Y = log(X) and X Exponential(). X 22, i.e. X has a chi-squared distribution with 2 degrees of freedom, if Let X Exponential(X) and Y Exponential(Y) be independent. Then function Other related distributions: Hyper-exponential distribution the distribution whose density is a weighted sum of exponential densities. Hypoexponential distribution the distribution of a general sum of exponential random variables. exGaussian distribution the sum of an exponential distribution and a normal distribution. . This can be used to obtain a confidence interval for has probability density . .

Exponential distribution

227

Applications
Occurrence of events
The exponential distribution occurs naturally when describing the lengths of the inter-arrival times in a homogeneous Poisson process. The exponential distribution may be viewed as a continuous counterpart of the geometric distribution, which describes the number of Bernoulli trials necessary for a discrete process to change state. In contrast, the exponential distribution describes the time for a continuous process to change state. In real-world scenarios, the assumption of a constant rate (or probability per unit time) is rarely satisfied. For example, the rate of incoming phone calls differs according to the time of day. But if we focus on a time interval during which the rate is roughly constant, such as from 2 to 4 p.m. during work days, the exponential distribution can be used as a good approximate model for the time until the next phone call arrives. Similar caveats apply to the following examples which yield approximately exponentially distributed variables: The time until a radioactive particle decays, or the time between clicks of a geiger counter The time it takes before your next telephone call The time until default (on payment to company debt holders) in reduced form credit risk modeling Exponential variables can also be used to model situations where certain events occur with a constant probability per unit length, such as the distance between mutations on a DNA strand, or between roadkills on a given road. In queuing theory, the service times of agents in a system (e.g. how long it takes for a bank teller etc. to serve a customer) are often modeled as exponentially distributed variables. (The inter-arrival of customers for instance in a system is typically modeled by the Poisson distribution in most management science textbooks.) The length of a process that can be thought of as a sequence of several independent tasks is better modeled by a variable following the Erlang distribution (which is the distribution of the sum of several independent exponentially distributed variables). Reliability theory and reliability engineering also make extensive use of the exponential distribution. Because of the memoryless property of this distribution, it is well-suited to model the constant hazard rate portion of the bathtub curve used in reliability theory. It is also very convenient because it is so easy to add failure rates in a reliability model. The exponential distribution is however not appropriate to model the overall lifetime of organisms or technical devices, because the "failure rates" here are not constant: more failures occur for very young and for very old systems. In physics, if you observe a gas at a fixed [6] rainfalls using CumFreq temperature and pressure in a uniform gravitational field, the heights of the various molecules also follow an approximate exponential distribution. This is a consequence of the entropy property mentioned below.
Fitted cumulative exponential distribution to annually maximum 1-day

In hydrology, the exponential distribution is used to analyze extreme values of such variables as monthly and annual maximum values of daily rainfall and river discharge volumes.[7] The blue picture illustrates an example of fitting the exponential distribution to ranked annually maximum one-day rainfalls showing also the 90% confidence belt based on the binomial distribution. The rainfall data are represented by plotting positions as part of the cumulative frequency analysis.

Exponential distribution

228

Prediction
Having observed a sample of n data points from an unknown exponential distribution a common task is to use these samples to make predictions about future data from the same source. A common predictive distribution over future samples is the so-called plug-in distribution, formed by plugging a suitable estimate for the rate parameter into the exponential density function. A common choice of estimate is the one provided by the principle of maximum likelihood, and using this yields the predictive density over a future sample xn+1, conditioned on the observed samples x = (x1, ..., xn) given by

The Bayesian approach provides a predictive distribution which takes into account the uncertainty of the estimated parameter, although this may depend crucially on the choice of prior. A predictive distribution free of the issues of choosing priors that arise under the subjective Bayesian approach is , which can be considered as (1) a frequentist confidence distribution, obtained from the distribution of the pivotal quantity ;[8] (2) a profile predictive likelihood, obtained by eliminating the parameter from the joint likelihood of and by maximization;[9] (3) an objective Bayesian predictive posterior distribution, obtained ; and (4) the Conditional Normalized Maximum Likelihood (CNML) using the non-informative Jeffreys prior

predictive distribution, from information theoretic considerations.[10] The accuracy of a predictive distribution may be measured using the distance or divergence between the true exponential distribution with rate parameter, 0, and the predictive distribution based on the sample x. The KullbackLeibler divergence is a commonly used, parameterisation free measure of the difference between two distributions. Letting (0||p) denote the KullbackLeibler divergence between an exponential with rate parameter 0 and a predictive distribution p it can be shown that

where the expectation is taken with respect to the exponential distribution with rate parameter 0 (0, ), and ( ) is the digamma function. It is clear that the CNML predictive distribution is strictly superior to the maximum likelihood plug-in distribution in terms of average KullbackLeibler divergence for all sample sizes n > 0.

References
[1] Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (http:/ / www. wise. xmu. edu. cn/ Master/ Download/ . . \. . \UploadFiles\paper-masterdownload\2009519932327055475115776. pdf). Journal of Econometrics (Elsevier): 219230. . Retrieved 2011-06-02. [2] Ross, Sheldon M. (2009). Introduction to probability and statistics for engineers and scientists (http:/ / books. google. com/ books?id=mXP_UEiUo9wC& pg=PA267) (4th ed.). Associated Press. p.267. ISBN978-0-12-370483-2. . [3] Guerriero V. et al. (2010). "Quantifying uncertainties in multi-scale studies of fractured reservoir analogues: Implemented statistical analysis of scan line data from carbonate rocks" (PDF). Journal of Structural Geology (Elsevier). doi:10.1016/j.jsg.2009.04.016. [4] Donald E. Knuth (1998). The Art of Computer Programming, volume 2: Seminumerical Algorithms, 3rd edn. Boston: AddisonWesley. ISBN 0-201-89684-2. See section 3.4.1, p. 133. [5] Luc Devroye (1986). Non-Uniform Random Variate Generation (http:/ / luc. devroye. org/ rnbookindex. html). New York: Springer-Verlag. ISBN 0-387-96305-7. See chapter IX (http:/ / luc. devroye. org/ chapter_nine. pdf), section 2, pp. 392401. [6] "Cumfreq, a free computer program for cumulative frequency analysis" (http:/ / www. waterlog. info/ cumfreq. htm). . [7] Ritzema (ed.), H.P. (1994). Frequency and Regression Analysis (http:/ / www. waterlog. info/ pdf/ freqtxt. pdf). Chapter 6 in: Drainage Principles and Applications, Publication 16, International Institute for Land Reclamation and Improvement (ILRI), Wageningen, The Netherlands. pp.175224. ISBN90-70754-33-9. .

Exponential distribution
[8] Lawless, J.F., Fredette, M.,"Frequentist predictions intervals and predictive distributions", Biometrika (2005), Vol 92, Issue 3, pp 529542. [9] Bjornstad, J.F., "Predictive Likelihood: A Review", Statist. Sci. Volume 5, Number 2 (1990), 242254. [10] D. F. Schmidt and E. Makalic, " Universal Models for the Exponential Distribution (http:/ / www. emakalic. org/ blog/ wp-content/ uploads/ 2010/ 04/ SchmidtMakalic09b. pdf)", IEEE Transactions on Information Theory, Volume 55, Number 7, pp. 30873090, 2009 doi:10.1109/TIT.2009.2018331

229

External links
Hazewinkel, Michiel, ed. (2001), "Exponential distribution" (http://www.encyclopediaofmath.org/index. php?title=p/e036900), Encyclopedia of Mathematics, Springer, ISBN978-1-55608-010-4 Online calculator of Exponential Distribution (http://www.stud.feec.vutbr.cz/~xvapen02/vypocty/ex. php?language=english)

230

Multivariate Continuous Distributions


Multivariate normal distribution
Probability density function

Many samples from a multivariate (bivariate) Gaussian distribution centered at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction (longer vector) and of 1 in the second direction (shorter vector, orthogonal to the longer vector). Notation Parameters Rk location Rkk covariance (nonnegative-definite matrix) x +span() Rk

Support PDF

exists only when is positive-definite CDF Mean Mode Variance Entropy MGF CF (no analytic expression)

In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. However, its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

Multivariate normal distribution

231

Notation and parametrization


The multivariate normal distribution of a k-dimensional random vector x = [X1, X2, , Xk] can be written in the following notation:

or to make it explicitly known that X is k-dimensional,

with k-dimensional mean vector

and k x k covariance matrix

Definition
A random vector x = (X1, , Xk)' is said to have the multivariate normal distribution if it satisfies the following equivalent conditions.[1] Every linear combination of its components Y=a1X1 + + akXk is normally distributed. That is, for any constant vector a Rk, the random variable Y = ax has a univariate normal distribution. There exists a random -vector z, whose components are independent standard normal random variables, a k-vector , and a k matrix A, such that x = Az + . Here is the rank of the covariance matrix = AA. Especially in the case of full rank, see the section below on Geometric Interpretation. There is a k-vector and a symmetric, nonnegative-definite kk matrix , such that the characteristic function of x is

The covariance matrix is allowed to be singular (in which case the corresponding distribution has no density). This case arises frequently in statistics; for example, in the distribution of the vector of residuals in the ordinary least squares regression. Note also that the Xi are in general not independent; they can be seen as the result of applying the matrix A to a collection of independent Gaussian variables z.

Properties
Density function
Non-degenerate case The multivariate normal distribution is said to be "non-degenerate" when the covariance matrix multivariate normal distribution is symmetric and positive definite. In this case the distribution has density of the

where

is the determinant of is a

. Note how the equation above reduces to that of the univariate normal

distribution if Bivariate case

matrix (i.e. a real number).

In the 2-dimensional nonsingular case (k = rank() = 2), the probability density function of a vector [X Y] is

Multivariate normal distribution where is the correlation between X and Y and where and . In this case,

232

In the bivariate case, we also have a theorem that makes the first equivalent condition for multivariate normality less restrictive: it is sufficient to verify that countably many distinct linear combinations of X and Y are normal in order to conclude that the vector [X Y] is bivariate normal.[2] When plotted in the x,y-plane the distribution appears to be squeezed to the line:

as the correlation parameter increases. This is because the above expression is the best linear unbiased prediction of Y given a value of X.[3] Degenerate case If the covariance matrix is not full rank, then the multivariate normal distribution is degenerate and does not have a density. More precisely, it does not have a density with respect to k-dimensional Lebesgue measure (which is the usual measure assumed in calculus-level probability courses). Only random vectors whose distributions are absolutely continuous with respect to a measure are said to have densities (with respect to that measure). To talk about densities but avoid dealing with measure-theoretic complications it can be simpler to restrict attention to a subset of of the coordinates of such that covariance matrix for this subset is positive definite; then the other coordinates may be thought of as an affine function of the selected coordinates. To talk about densities meaningfully in the singular case, then, we must select a different base measure. Using the disintegration theorem we can define a restriction of Lebesgue measure to the -dimensional affine subspace of where the Gaussian distribution is supported, i.e. . With respect to this

probability measure the distribution has density: where is the generalized inverse and det* is the pseudo-determinant.

Higher moments
The kth-order moments of x are defined by

where r1 + r2 + + rN = k. The central k-order central moments are given as follows (a) If k is odd, 1, , N(x ) = 0. (b) If k is even with k = 2, then

where the sum is taken over all allocations of the set

into (unordered) pairs. That is, if you have a

kth ( = 2 = 6) central moment, you will be summing the products of = 3 covariances (the - notation has been dropped in the interests of parsimony):

Multivariate normal distribution

233

This yields

terms in the sum (15 in the above case), each being the product of (in

this case 3) covariances. For fourth order moments (four variables) there are three terms. For sixth-order moments there are 35 = 15 terms, and for eighth-order moments there are 357 = 105 terms. The covariances are then determined by replacing the terms of the list by the corresponding terms of the list consisting of r1 ones, then r2 twos, etc.. To illustrate this, examine the following 4th-order central moment case:

where

is the covariance of xi and xj. The idea with the above method is you first find the general case for a kth and then you can simplify this accordingly. Say, then you simply let xi = xj and realise that ii = i2.

moment where you have k different x variables you have

Likelihood function
If the mean and variance matrix are unknown, a suitable log likelihood function for a single observation x would be:

where x is a vector of real numbers. The complex case, where z is a vector of complex numbers, would be: . A similar notation is used for multiple linear regression.[4]

Entropy
The differential entropy of the multivariate normal distribution is[5]

where

is the determinant of the covariance matrix .

Multivariate normal distribution

234

KullbackLeibler divergence
The KullbackLeibler divergence from to , for non-singular matrices 0 and 1, is:[6]

The logarithm must be taken to base e since the two terms following the logarithm are themselves base-e logarithms of expressions that are either factors of the density function or otherwise arise naturally. The equation therefore gives a result measured in nats. Dividing the entire expression above by loge2 yields the divergence in bits.

Cumulative distribution function


The cumulative distribution function (cdf) F(x0) of a random vector x is defined as the probability that all components of x are less than or equal to the corresponding values in the vectorx0. Though there is no closed form for F(x), there are a number of algorithms that estimate it numerically.

Tolerance region
The equivalent for the univariate Normal distribution tolerance intervals in the multivariate case would yield a tolerance region. Such region consists of those vectors x satisfying

Here

is a

-dimensional vector,

is the known

-dimensional mean vector,

is the known covariance degrees of

matrix and freedom. When

is the quantile function for probability

of the chi-squared distribution with

the expression defines the interior of an ellipse and the chi-squared distribution simplifies to an

exponential distribution with mean equal to two.

Joint normality
Normally distributed and independent
If X and Y are normally distributed and independent, this implies they are "jointly normally distributed", i.e., the pair (X,Y) must have multivariate normal distribution. However, a pair of jointly normally distributed variables need not be independent.

Two normally distributed random variables need not be jointly bivariate normal
The fact that two random variables X and Y both have a normal distribution does not imply that the pair (X,Y) has a joint normal distribution. A simple example is one in which X has a normal distribution with expected value 0 and variance 1, and Y=X if |X|>c and Y=X if |X|<c, where c>0. There are similar counterexamples for more than two random variables.

Multivariate normal distribution

235

Correlations and independence


In general, random variables may be uncorrelated but highly dependent. But if a random vector has a multivariate normal distribution then any two or more of its components that are uncorrelated are independent. This implies that any two or more of its components that are pairwise independent are independent. But it is not true that two random variables that are (separately, marginally) normally distributed and uncorrelated are independent. Two random variables that are normally distributed may fail to be jointly normally distributed, i.e., the vector whose components they are may fail to have a multivariate normal distribution. For an example of two normally distributed random variables that are uncorrelated but not independent, see normally distributed and uncorrelated does not imply independent.

Conditional distributions
If and are partitioned as follows with sizes with sizes then, the distribution of x1 conditional on x2 = a is multivariate normal (x1|x2 = a) ~ N(, ) where and covariance matrix
[7]

This matrix is the Schur complement of 22 in . This means that to calculate the conditional covariance matrix, one inverts the overall covariance matrix, drops the rows and columns corresponding to the variables being conditioned upon, and then inverts back to get the conditional covariance matrix. Here is the generalized inverse of Note that knowing that x2 = a alters the variance, though the new variance does not depend on the specific value of a; perhaps more surprisingly, the mean is shifted by ; compare this with the situation of not knowing the value of a, in which case x1 would have distribution are independent. The matrix 12221 is known as the matrix of regression coefficients. In the bivariate case where x is partitioned into X1 and X2, the conditional distribution of X1 given X2 is . and An interesting fact derived in order to prove this result, is that the random vectors

where

is the correlation coefficient between X1 and X2.

Multivariate normal distribution

236

Bivariate conditional expectation


In the case

the following result holds

where the final ratio here is called the inverse Mills ratio.

Marginal distributions
To obtain the marginal distribution over a subset of multivariate normal random variables, one only needs to drop the irrelevant variables (the variables that one wants to marginalize out) from the mean vector and the covariance matrix. The proof for this follows from the definitions of multivariate normal distributions and linear algebra.[8] Example Let x = [X1, X2, X3] be multivariate normal random variables with mean vector = [1, 2, 3] and covariance matrix (standard parametrization for multivariate normal distributions). Then the joint distribution of x = [X1, X3] is multivariate normal with mean vector = [1, 3] and covariance matrix .

Affine transformation
If y = c + Bx is an affine transformation of constant BB i.e.,
T

where c is an

vector of constants and B is a

matrix, then y has a multivariate normal distribution with expected value c + B and variance . In particular, any subset of the xi has a marginal distribution that is also

multivariate normal. To see this, consider the following example: to extract the subset (x1, x2, x4)T, use

which extracts the desired elements directly. Another corollary is that the distribution of Z = b x, where b is a constant vector of the same length as x and the dot indicates a vector product, is univariate Gaussian with . This result follows by using

and considering only the first component of the product (the first row of B is the vector b). Observe how the positive-definiteness of implies that the variance of the dot product must be positive. An affine transformation of x such as 2x is not the same as the sum of two independent realisations of x.

Multivariate normal distribution

237

Geometric interpretation
The equidensity contours of a non-singular multivariate normal distribution are ellipsoids (i.e. linear transformations of hyperspheres) centered at the mean.[9] The directions of the principal axes of the ellipsoids are given by the eigenvectors of the covariance matrix . The squared relative lengths of the principal axes are given by the corresponding eigenvalues. If = UUT = U1/2(U1/2)T is an eigendecomposition where the columns of U are unit eigenvectors and is a diagonal matrix of the eigenvalues, then we have

Moreover, U can be chosen to be a rotation matrix, as inverting an axis does not have any effect on N(0, ), but inverting a column changes the sign of U's determinant. The distribution N(, ) is in effect N(0, I) scaled by 1/2, rotated by U and translated by . Conversely, any choice of , full rank matrix U, and positive diagonal entries i yields a non-singular multivariate normal distribution. If any i is zero and U is square, the resulting covariance matrix UUT is singular. Geometrically this means that every contour ellipsoid is infinitely thin and has zero volume in n-dimensional space, as at least one of the principal axes has length of zero.

Estimation of parameters
The derivation of the maximum-likelihood estimator of the covariance matrix of a multivariate normal distribution is perhaps surprisingly subtle and elegant. See estimation of covariance matrices. In short, the probability density function (pdf) of an k-dimensional multivariate normal is

and the ML estimator of the covariance matrix from a sample of n observations is

which is simply the sample covariance matrix. This is a biased estimator whose expectation is

An unbiased sample covariance is

The Fisher information matrix for estimating the parameters of a multivariate normal distribution has a closed form expression. This can be used, for example, to compute the CramrRao bound for parameter estimation in this setting. See Fisher information for more details.

Multivariate normal distribution

238

Bayesian inference
In Bayesian statistics, the conjugate prior of the mean vector is another multivariate normal distribution, and the conjugate prior of the covariance matrix is an inverse-Wishart distribution . Suppose then that n observations have been made and that a conjugate prior has been assigned, where

where

and

Then,

where

Multivariate normality tests


Multivariate normality tests check a given set of data for similarity to the multivariate normal distribution. The null hypothesis is that the data set is similar to the normal distribution, therefore a sufficiently small p-value indicates non-normal data. Multivariate normality tests include the Cox-Small test[10] and Smith and Jain's adaptation[11] of the Friedman-Rafsky test.[12] Mardia's test[13] is based on multivariate extensions of skewness and kurtosis measures. For a sample {x1, ..., xn} of k-dimensional vectors we compute

Under the null hypothesis of multivariate normality, the statistic A will have approximately a chi-squared distribution with 16k(k + 1)(k + 2) degrees of freedom, and B will be approximately standard normal N(0,1). Mardia's kurtosis statistic is skewed and converges very slowly to the limiting normal distribution. For medium size samples , the parameters of the asymptotic distribution of the kurtosis statistic are modified[14] For small sample tests ( ) empirical critical values are used. Tables of critical values for both statistics are given by Rencher[15] for k=2,3,4. Mardia's tests are affine invariant but not consistent. For example, the multivariate skewness test is not consistent against symmetric non-normal alternatives. The BHEP test[16] computes the norm of the difference between the empirical characteristic function and the theoretical characteristic function of the normal distribution. Calculation of the norm is performed in the L2() space

Multivariate normal distribution of square-integrable functions with respect to the Gaussian weighting function

239 . The test statistic is

The limiting distribution of this test statistic is a weighted sum of chi-squared random variables,[17] however in practice it is more convenient to compute the sample quantiles using the Monte-Carlo simulations. A detailed survey of these and other test procedures is available.[18]

Drawing values from the distribution


A widely used method for drawing a random vector x from the N-dimensional multivariate normal distribution with mean vector and covariance matrix works as follows: 1. Find any real matrix A such that A AT = . When is positive-definite, the Cholesky decomposition is typically used, and the extended form of this decomposition can be used in the more general nonnegative-definite case: in both cases a suitable matrix A is obtained. An alternative is to use the matrix A = U obtained from a spectral decomposition = UUT of . The former approach is more computationally straightforward but the matrices A change for different orderings of the elements of the random vector, while the latter approach gives matrices that are related by simple re-orderings. In theory both approaches give equally good ways of determining a suitable matrix A, but there are differences in compuation time. 2. Let z = (z1, , zN)T be a vector whose components are N independent standard normal variates (which can be generated, for example, by using the BoxMuller transform). 3. Let x be + Az. This has the desired distribution due to the affine transformation property.

References
[1] Gut, Allan (2009) An Intermediate Course in Probability, Springer. ISBN 9781441901613 (Chapter 5) [2] Hamedani, G. G.; Tata, M. N. (1975). "On the determination of the bivariate normal distribution from distributions of linear combinations of the variables". The American Mathematical Monthly 82 (9): 913915. doi:10.2307/2318494. [3] Wyatt, John. "Linear least mean-squared error estimation" (http:/ / web. mit. edu/ 6. 041/ www/ LECTURE/ lec22. pdf). Lecture notes course on applied probability. . Retrieved 23 January 2012. [4] Tong, T. (2010) Multiple Linear Regression : MLE and Its Distributional Results (http:/ / amath. colorado. edu/ courses/ 7400/ 2010Spr/ lecture9. pdf), Lecture Notes [5] Gokhale, DV; NA Ahmed, BC Res, NJ Piscataway (May 1989). "Entropy Expressions and Their Estimators for Multivariate Distributions". Information Theory, IEEE Transactions on 35 (3): 688692. doi:10.1109/18.30996. [6] Penny & Roberts, PARG-00-12, (2000) (http:/ / www. allisons. org/ ll/ MML/ KL/ Normal). pp. 18 [7] Eaton, Morris L. (1983). Multivariate Statistics: a Vector Space Approach. John Wiley and Sons. pp.116117. ISBN0-471-02776-6. [8] The formal proof for marginal distribution is shown here http:/ / fourier. eng. hmc. edu/ e161/ lectures/ gaussianprocess/ node7. html [9] Nikolaus Hansen. "The CMA Evolution Strategy: A Tutorial" (http:/ / www. lri. fr/ ~hansen/ cmatutorial. pdf) (PDF). . [10] Cox, D. R.; N. J. H. Small (August 1978). "Testing multivariate normality". Biometrika 65 (2): 263272. doi:10.1093/biomet/65.2.263. [11] Smith, Stephen P.; Anil K. Jain (September 1988). "A test to determine the multivariate normality of a dataset". IEEE Transactions on Pattern Analysis and Machine Intelligence 10 (5): 757761. doi:10.1109/34.6789. [12] Friedman, J. H. and Rafsky, L. C. (1979) "Multivariate generalizations of the Wald-Wolfowitz and Smirnov two sample tests". Annals of Statistics, 7, 697717. [13] Mardia, K. V. (1970). "Measures of multivariate skewness and kurtosis with applications". Biometrika 57 (3): 519530. doi:10.1093/biomet/57.3.519. [14] Rencher (1995), pages 112-113. [15] Rencher (1995), pages 493-495. [16] Epps, Lawrence B.; Pulley, Lawrence B. (1983). "A test for normality based on the empirical characteristic function". Biometrika 70 (3): 723726. doi:10.1093/biomet/70.3.723. [17] Baringhaus, L.; Henze, N. (1988). "A consistent test for multivariate normality based on the empirical characteristic function". Metrika 35 (1): 339348. doi:10.1007/BF02613322.

Multivariate normal distribution


[18] Henze, Norbert (2002). "Invariant tests for multivariate normality: a critical review". Statistical Papers 43 (4): 467506. doi:10.1007/s00362-002-0119-6.

240

Literature
Rencher, A.C. (1995). Methods of Multivariate Analysis. New York: Wiley.

Wishart distribution

241

Wishart distribution
Wishart Parameters Support PDF

degrees of freedom (real) scale matrix ( pos. def) positive definite matrix

is the multivariate gamma function is the trace function

Mean Mode Variance Entropy CF see below

In statistics, the Wishart distribution is a generalization to multiple dimensions of the chi-squared distribution, or, in the case of non-integer degrees of freedom, of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928.[1] It is any of a family of probability distributions defined over symmetric, nonnegative-definite matrix-valued random variables (random matrices). These distributions are of great importance in the estimation of covariance matrices in multivariate statistics. In Bayesian statistics, the Wishart distribution is the conjugate prior of the inverse covariance-matrix of a multivariate-normal random-vector.

Definition
Suppose X is an n p matrix, each row of which is independently drawn from a p-variate normal distribution with zero mean:

Then the Wishart distribution is the probability distribution of the pp random matrix

known as the scatter matrix. One indicates that S has that probability distribution by writing

The positive integer n is the number of degrees of freedom. Sometimes this is written W(V,p,n). For np the matrix S is invertible with probability 1 if V is invertible. If p = 1 and V = 1 then this distribution is a chi-squared distribution with n degrees of freedom.

Wishart distribution

242

Occurrence
The Wishart distribution arises as the distribution of the sample covariance matrix for a sample from a multivariate normal distribution. It occurs frequently in likelihood-ratio tests in multivariate statistical analysis. It also arises in the spectral theory of random matrices and in multidimensional Bayesian analysis.

Probability density function


The Wishart distribution can be characterized by its probability density function, as follows. Let be a pp symmetric matrix of random variables that is positive definite. Let V be a (fixed) positive definite has a Wishart distribution with n degrees of freedom if it has a probability density function given matrix of size pp. Then, if n p, by

where p() is the multivariate gamma function defined as In fact the above definition can be extended to any real n>p1. If n p2, then the Wishart no longer has a densityinstead it represents a singular distribution. [2]

Properties
Log-expectation
Note the following formula:[3]

where

is the digamma function (the derivative of the log of the gamma function).

This plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution.

Entropy
The information entropy of the distribution has the following formula:[3]

where

is the normalizing constant of the distribution:

This can be expanded as follows:

Wishart distribution

243

Characteristic function
The characteristic function of the Wishart distribution is

In other words,

where

denotes expectation. (Here

and are matrices the same size as

( is the identity matrix); and

is the square root of1).

Theorem
If has a Wishart distribution with m degrees of freedom and variance matrix
[4]

write

and

is

a qp matrix of rank q, then

Corollary 1
If is a nonzero constant vector, then[4] is the chi-squared distribution and . (note that is a constant; it is positive because In this case,

is positive definite).

Wishart distribution

244

Corollary 2
Consider the case where corollary 1 above shows that gives the marginal distribution of each of the elements on the matrix's diagonal. Noted statistician George Seber points out that the Wishart distribution is not called the multivariate chi-squared distribution because the marginal distribution of the off-diagonal elements is not chi-squared. Seber prefers to reserve the term multivariate for the case when all univariate marginals belong to the same family. (that is, the jth element is one and all others zero). Then

Estimator of the multivariate normal distribution


The Wishart distribution is the sampling distribution of the maximum-likelihood estimator (MLE) of the covariance matrix of a multivariate normal distribution with mean zero. A derivation of the MLE uses the spectral theorem.

Bartlett decomposition
The Bartlett decomposition of a matrix of freedom is the factorization: from a p-variate Wishart distribution with scale matrix V and n degrees

where L is the Cholesky decomposition of V, and:

where

and
[5]

independently. This provides a useful method for obtaining random

samples from a Wishart distribution.

The possible range of the shape parameter


It can be shown [6] that the Wishart distribution can be defined if and only if the shape parameter n belongs to the set This set is named after Gindikin, who introduced it[7] in the seventies in the context of gamma distributions on homogeneous cones. However, for the new parameters in the discrete spectrum of the Gindikin ensemble, namely,

the corresponding Wishart distribution has no Lebesgue density.

Wishart distribution

245

Relationships to other distributions


The Wishart distribution is related to the Inverse-Wishart distribution, denoted by and if we do the change of variables , then , as follows: If . This

relationship may be derived by noting that the absolute value of the Jacobian determinant of this change of variables is , see for example equation (15.15) in.[8] In Bayesian statistics, the Wishart distribution is a conjugate prior for the precision parameter of the multivariate normal distribution, when the mean parameter is known.[9] A generalization is the multivariate gamma distribution. A different type of generalization is the normal-Wishart distribution, essentially the product of a multivariate normal distribution with a Wishart distribution.

References
[1] Wishart, J. (1928). "The generalised product moment distribution in samples from a normal multivariate population". Biometrika 20A (1-2): 3252. doi:10.1093/biomet/20A.1-2.32. JFM54.0565.02. [2] On singular Wishart and singular multivariate beta distributions by Harald Uhlig, The Annals of Statistics, 1994, 395-405 projecteuclid (http:/ / projecteuclid. org/ DPubS?service=UI& version=1. 0& verb=Display& handle=euclid. aos/ 1176325375) [3] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006, p. 693. [4] Rao, C. R., Linear statistical inference and its applications, Wiley 1965, p. 535. [5] Smith, W. B.; Hocking, R. R. (1972). "Algorithm AS 53: Wishart Variate Generator". Journal of the Royal Statistical Society. Series C (Applied Statistics) 21 (3): 341345. JSTOR2346290. [6] Peddada and Richards, Shyamal Das; Richards, Donald St. P. (1991). "Proof of a Conjecture of M. L. Eaton on the Characteristic Function of the Wishart Distribution,". Annals of Probability 19 (2): 868874. doi:10.1214/aop/1176990455. [7] Gindikin, S.G. (1975). "Invariant generalized functions in homogeneous domains,". Funct. Anal. Appl., 9 (1): 5052. doi:10.1007/BF01078179. [8] Paul S. Dwyer, SOME APPLICATIONS OF MATRIX DERIVATIVES IN MULTIVARIATE ANALYSIS, JASA 1967; 62:607-625, available JSTOR (http:/ / www. jstor. org/ pss/ 2283988). [9] C.M. Bishop, Pattern Recognition and Machine Learning, Springer 2006.

Article Sources and Contributors

246

Article Sources and Contributors


Probability distribution Source: http://en.wikipedia.org/w/index.php?oldid=510788258 Contributors: (:Julien:), 198.144.199.xxx, 3mta3, A.M.R., A3141592653589, A5, Abhinav316, AbsolutDan, Adrokin, Alansohn, Alexius08, Amircrypto, Anonymous Dissident, Ap, Applepiein, Avenue, AxelBoldt, BD2412, Baccyak4H, Benwing, Bfigura's puppy, Bhoola Pakistani, Bkkbrad, Branny 96, Bryan Derksen, Btyner, Calvin 1998, Caramdir, Cburnett, Chirlu, Chris the speller, ChrisGualtieri, Classical geographer, Closedmouth, Conversion script, Courcelles, Csigabi, Damian Yerrick, Davhorn, David Eppstein, David Vose, DavidCBryant, Dcljr, Delldot, Den fjttrade ankan, Dick Beldin, Digisus, Dino, Domminico, Dysprosia, Eliezg, Emijrp, Epbr123, Eric Kvaalen, Fintor, Firelog, Fnielsen, Frietjes, G716, Gaius Cornelius, Gala.martin, Gandalf61, Gate2quality, Giftlite, Gjnaasaa, GoodDamon, Graham87, Hamamelis, Hu12, Hughperkins, I dream of horses, ImperfectlyInformed, It Is Me Here, Iwaterpolo, J.delanoy, JJ Harrison, JRSpriggs, Jan eissfeldt, JayJasper, Jclemens, Jipumarino, Jitse Niesen, Jncraton, Johndburger, Jojalozzo, Jon Awbrey, Josuechan, Jsd115, Jsnx, Jtkiefer, Kastchei, Knutux, Larryisgood, LiDaobing, Lilac Soul, Lollerskates, Lotje, Loupeter, MGriebe, Magioladitis, Marie Poise, MarkSweep, Markhebner, Marner, Megaloxantha, Melcombe, Mental Blank, Michael Hardy, Miguel, MisterSheik, Morton.lin, MrOllie, Napzilla, Nbarth, NuclearWarfare, O18, OdedSchramm, Ojigiri, OlEnglish, OverInsured, Oxymoron83, PAR, Pabristow, Patrick, Paul August, Pax:Vobiscum, Pgan002, Phys, Policron, Ponnu, Poor Yorick, Populus, Ptrf, Quietbritishjim, Qwfp, R'n'B, Riceplaytexas, Rich Farmbrough, Richard D. LeCour, Rinconsoleao, Roger.simmons, Rumping, Rursus, Rwalker, Salgueiro, Salix alba, Samois98, Sandym, Schmock, Seglea, Serguei S. Dukachev, ServiceAT, ShaunES, Shizhao, Silly rabbit, SiobhanHansa, Sky Attacker, Statlearn, Stpasha, TNARasslin, TakuyaMurata, Tarotcards, Tayste, Techman224, Tedtoal, Teply, TexasDawg, Thamelry, The Anome, The Thing That Should Not Be, TheCoffee, Tillander, Tomi, Topology Expert, Tordek ar, Tsirel, Ttony21, Unyoyega, Uvainio, Velella, VictorAnyakin, WestwoodMatt, Whosasking, Whosyourjudas, X-Bert, Xuhuyang, Zundark, 269 anonymous edits Bernoulli distribution Source: http://en.wikipedia.org/w/index.php?oldid=512660101 Contributors: Adriaan Joubert, Albmont, AlekseyP, Alex.j.flint, Amonet, Andreas27krause, Aquae, Aziz1005, Bando26, Bgeelhoed, Bryan Derksen, Btyner, Camkego, Cburnett, Charles Matthews, Complex01, Discospinster, El C, Eric Kvaalen, FilipeS, Flatland1, Giftlite, Herix, ILikeHowMuch, Iwaterpolo, Jitse Niesen, Jpk, Jt, Kyng, Lilac Soul, Lothar von Richthofen, MarkSweep, Melcombe, Michael Hardy, Miguel, MrOllie, Olivier, Ozob, PAR, Pabristow, Policron, Poor Yorick, Qwfp, RDBury, Rdsmith4, Schmock, Sharmistha1, TakuyaMurata, Theyshallbow, Tomash, Tomi, Typofier, Urhixidur, User3000, Weialawaga, Whkoh, Wikid77, Wjastle, Wtanaka, Zven, 51 anonymous edits Binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=508848826 Contributors: -- April, Aarond10, AchatesAVC, AdamRetchless, Ahoerstemeier, Ajs072, AlanUS, [email protected], Alexius08, Alzarian16, Anupam, Atemperman, Atlant, AxelBoldt, Ayla, BPets, Baccyak4H, BenFrantzDale, Benwing, Bill Malloy, Blue520, Br43402, Brutha, Bryan Derksen, Btyner, Can't sleep, clown will eat me, Cburnett, Cdang, Cflm001, Charles Matthews, Chewings72, Conversion script, Coppertwig, Crackerbelly, Cuttlefishy, DaDexter, David Martland, DavidFHoughton, Daytona2, Deville, Dick Beldin, DrMicro, Dricherby, Duoduoduo, Eesnyder, Elipongo, Eric Kvaalen, Falk Lieder, Fisherjs, Froid, G716, Garde, Gary King, Gauravm1312, Gauss, Gerald Tros, Giftlite, Gogobera, GorillaWarfare, Gperjim, Graham87, Hede2000, Henrygb, Hirak 99, Ian.Shannon, Ilmari Karonen, Intelligentsium, Iwaterpolo, J04n, JB82, JEH, JamesBWatson, Janlo, Johnstjohn, Kakofonous, Kmassey, Knutux, Koczy, LOL, Larry_Sanger, LiDaobing, Linas, Lipedia, Ljwsummer, Logan, Lucaswilkins, MC-CPO, MER-C, ML5, MSGJ, Madkaugh, Mark Arsten, MarkSweep, Marvinrulesmars, Materialscientist, Mboverload, McKay, Meisterkoch, Melcombe, Mhadi.afrasiabi, Michael Hardy, MichaelGensheimer, Miguel, MisterSheik, Mmustafa, Moseschinyama, Mr Ape, MrOllie, Musiphil, N5iln, N6ne, Nasnema, NatusRoma, Nbarth, Neshatian, New Thought, Nguyenngaviet, Nschuma, Oleg Alexandrov, PAR, Pallaviagarwal90, Paul August, Ph.eyes, PhotoBox, Phr, Pleasantville, Postrach, PsyberS, Pt, Pufferfish101, Qonnec, Quietbritishjim, Qwertyus, Qwfp, R'n'B, Redtryfan77, Rgclegg, Rich Farmbrough, Rjmorris, Rlendog, Robinh, Ruber chiken, Sealed123, Seglea, Sigma0 1, Sintau.tayua, Smachet, SoSaysChappy, Spellcast, Stebulus, Steven J. Anderson, Stigin, Stpasha, Supergroupiejoy, TakuyaMurata, Talgalili, Tayste, Tedtoal, The Thing That Should Not Be, Tim1357, Timwi, Toll booth, Tomi, Tyw7, VectorPosse, Vincent Semeria, Welhaven, Westm, Wikid77, WillKitch, Wjastle, Xiao Fei, Ylloh, Youandme, ZantTrang, Zmoboros, 385 anonymous edits Uniform distribution (discrete) Source: http://en.wikipedia.org/w/index.php?oldid=513279665 Contributors: Alansohn, Alstublieft, Bob.warfield, Btyner, Colin Douglas Howell, DVdm, DaBler, Dec1707, DixonD, Duoduoduo, Fangz, Fasten, FilipeS, Furby100, Giftlite, Gvstorm, Hatster301, Henrygb, Iwaterpolo, Jamelan, Klausness, LimoWreck, Melcombe, Michael Hardy, Mike74dk, Nbarth, O18, P64, PAR, Paul August, Postrach, Qwfp, Random2001, Stannered, Taylorluker, The Wordsmith, User A1, 59 anonymous edits Poisson distribution Source: http://en.wikipedia.org/w/index.php?oldid=512914339 Contributors: 2620:C6:8000:300:E84F:1D25:AA79:1023, A3141592653589, Abtweed98, Adair2324, AdjustShift, Adoniscik, Aeusoes1, AhmedHan, Ahoerstemeier, AlanUS, Alexius08, Alfpooh, Anchoar2001, Andre Engels, Ankit.shende, Anomalocaris, Army1987, Atonix, AxelBoldt, BL, Baccyak4H, Bdmy, Beetstra, BenFrantzDale, Bender235, Bgeelhoed, Bidabadi, Bikasuishin, Bjcairns, Bobblewik, Brendan642, Bryan Derksen, Btyner, CameronHarris, Camitz, Captain-n00dle, Caricologist, Cburnett, ChevyC, Chinasaur, Chriscf, Cimon Avaro, Ciphergoth, Citrus Lover, Climbert8, Constructive editor, Coppertwig, Count ludwig, Cqqbme, Cubic Hour, Cwkmail, DARTH SIDIOUS 2, Damistmu, Danger, Dannomatic, DannyAsher, David Haslam, Deacon of Pndapetzim, Debastein, Dejvid, Denis.arnaud, Derek farn, Dhollm, Dougsim, DrMicro, Dreadstar, Drevicko, Duke Ganote, Eduardo Antico, Edward, Emilpohl, Enric Naval, EnumaElish, Everyking, Falk Lieder, Favonian, Fayenatic london, Fnielsen, Fresheneesz, Frobnitzem, Fuzzyrandom, Gaius Cornelius, Gcbernier, Giftlite, Giganut, Gigemag76, Giraffedata, Godix, Gperjim, GregorB, HamburgerRadio, Headbomb, Heliac, Henrygb, Hgkamath, Hu, HyDeckar, Hypnotoad33, Iae, Ian.Shannon, Ilmari Karonen, Inquisitus, Intervallic, Iridescent, Iwaterpolo, Jamesscottbrown, JavOs, Jeff G., Jfr26, Jitse Niesen, Jleedev, Jmichael ll, Joeltenenbaum, Joseph.m.ernst, Jpk, Jrennie, Jshen6, KSmrq, Kastchei, Kay Kiljae Lee, Kbk, King of Hearts, Kjfahmipedia, Kmtmeth, LOL, Laussy, Lgallindo, Lilac Soul, Linas, Ling Kah Jai, Lklundin, Logan, Loom91, Lucaswilkins, MC-CPO, Magicxcian, Marie Poise, MarkSweep, Mathstat, Maxis ftw, McKay, Mdebets, Melcombe, Michael Hardy, Michael Ross, Miguel, Mike Young, Mindmatrix, Minesweeper, MisterSheik, Mobius, MrOllie, Mufka, Mungbean, Munksm, NAHID, NabeelNM42, Nasnema, Nealmcb, Ned Scott, Netrapt, Nevsan, New Thought, Nicooo, Nijdam, Nipoez, Njerseyguy, Nsaa, O18, Ohnoitsjamie, Orionus, Ott2, PAR, PBH, Pabristow, Pb30, Pftupper, Pfunk42, Philaulait, Phreed, PierreAbbat, Piplicus, Plasmidmap, Pmokeefe, Postrach, Princebustr, Qacek, Quietbritishjim, Qwfp, Qxz, Robert Hiller, Robinh, Ryguasu, SPART82, Saimhe, Salgueiro, Sanmele, Saxenad, Schaber, Scientist xkcd fan, Sdedeo, Sean r lynch, SeanJA, Seaphoto, SebastianHelm, Selket, Sergey Suslov, Sergio01, Serrano24, Sheldrake, SiriusB, Skbkekas, Skittleys, Slarson, Snorgy, Spoon!, Stangaa, Steve8675309, Storm Rider, Stpasha, Strait, Sun Creator, Suslindisambiguator, Svick, Syzygy, TakuyaMurata, Talgalili, Taw, Taylanwiki, Tayste, Tbhotch, Tcaruso2, TeaDrinker, Tedtoal, Teply, The Anome, The Thing That Should Not Be, TheNoise, TheTaintedOne, Theda, Thorvaldurgunn, Tomi, Tommyjs, Tpb, Tpbradbury, Tpruane, Uncle Dick, Uriyan, User A1, Vector Alfawaz, VodkaJazz, Vrkaul, Wavelength, Weedier Mickey, Whosanehusy, Wikibuki, Wikid77, Wileycount, WillowW, Wjastle, Wtanaka, XJamRastafire, YogeshwerSharma, Youandme, ZioX, Zundark, jlfr, 449 anonymous edits Beta-binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=506664672 Contributors: Auntof6, Baccyak4H, Benwing, Charlesmartin14, Chris the speller, Domminico, Frederic Y Bois, Giftlite, Gnp, GoingBatty, Kdisarno, Massbless, Melcombe, Michael Hardy, Myasuda, Nschuma, PigFlu Oink, Qwfp, Rjwilmsi, Sheppa28, Thouis.r.jones, Thtanner, Tomixdf, Willy.feng, 34 anonymous edits Negative binomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=511878853 Contributors: 2001:468:C80:4390:AD57:41B5:CD6D:482F, A3141592653589, Airplaneman, Alexius08, Arunsingh16, Ascnder, Asymmetric, AxelBoldt, Benwing, Bo Jacoby, Bryan Derksen, Btyner, Burn, CALR, CRGreathouse, Cburnett, Charles Matthews, Chocrates, Colinpmillar, Cretog8, Damian Yerrick, Dcljr, Deathbyfungwahbus, DutchCanadian, DwightKingsbury, Econstatgeek, Eggstone, Entropeter, Evra83, Facorread, Felipehsantos, Formivore, Gabbe, Gauss, Giftlite, Headbomb, Henrygb, Hilgerdenaar, Iowawindow, Iwaterpolo, Jahredtobin, Jason Davies, Jfr26, Jmc200, Keltus, Kevinhsun, Kodiologist, Linas, Ludovic89, MC-CPO, Manoguru, MarkSweep, Mathstat, McKay, Melcombe, Michael Hardy, Mindmatrix, Moldi, Nov ialiste, Numcrun, O18, Odysseuscalypso, Oxymoron83, Panicpgh, Phantomofthesea, Pmokeefe, Qwfp, Rar, Renatovitolo, Rje, Rumping, Salgueiro, Sapphic, Schmock, Shreevatsa, Sleempaster21229, Sleepmaster21229, Statone, Steve8675309, Stpasha, Sumsum2010, TGS, Talgalili, Taraborn, Tedtoal, TomYHChan, Tomi, Trevor.maynard, User A1, Waltpohl, Wikid77, Wile E. Heresiarch, Wjastle, Zvika, 152 anonymous edits Geometric distribution Source: http://en.wikipedia.org/w/index.php?oldid=505976779 Contributors: AdamSmithee, AlanUS, Alexf, Amonet, Apocralyptic, Ashkax, Bjcairns, Bo Jacoby, Bryan Derksen, Btyner, Calbaer, Capricorn42, Cburnett, Classicalecon, Count ludwig, Damian Yerrick, Deineka, Digfarenough, El C, Eraserhead1, Felipehsantos, Frietjes, Gauss, Giftlite, Gogobera, Gsimard, Gkhan, Hhassanein, Hilgerdenaar, Imranaf, Iwaterpolo, Juergik, K.F., LOL, MarkSweep, MathKnight, Mav, Mcld, Melcombe, Michael Hardy, MichaelRutter, Mike Rosoft, Mikez, Mr.gondolier, NeonMerlin, Nov ialiste, PhotoBox, Qwfp, Ricklethickets, Robma, Rumping, Ryguasu, Serdagger, Skbkekas, Speed8224, Squizzz, Steve8675309, Sun Creator, SyedAshrafulla, TakuyaMurata, Terrek, Tomi, VoltzJer, Vrenator, Wafulz, Wikid77, Wjastle, Wrogiest, Wtruttschel, Xanthoxyl, Youandme, 119 anonymous edits Multinomial distribution Source: http://en.wikipedia.org/w/index.php?oldid=505446379 Contributors: A5, Adndns, Albmont, Baccyak4H, Benwing, Btyner, CaAl, Camrn86, Charles Matthews, ChevyC, Dysprosia, Entropeneur, Escapepea, Giftlite, Gjnaasaa, Guillaume Filion, Icairns, Iwaterpolo, J04n, Jamelan, Jamie King, Jeffnhood, Karlpearson, Killerandy, Linas, ManuLassila, McKay, Mebden, Melcombe, Michael Hardy, MisterSheik, Myasuda, Nbarth, Ninjagecko, O18, Qwfp, Robinh, Schmock, Semifinalist, Sohale, Squidonius, Stephan sand, Steve8675309, Tomi, Tomixdf, Welhaven, Wjastle, Wolfman, Zvika, 44 anonymous edits Categorical distribution Source: http://en.wikipedia.org/w/index.php?oldid=492025983 Contributors: Albmont, Awaterl, Benwing, DannyAsher, Disconcision, Giftlite, Headlessplatter, Holopoj, Joakimekstrom, Laughsinthestocks, Marie Poise, Mcld, Melcombe, MisterSheik, Nbarth, Pasmargo, Qwfp, Rbraunwa, Rhaertel80, Ringger, Rumping, Sderose, Wjastle, 23 anonymous edits Dirichlet distribution Source: http://en.wikipedia.org/w/index.php?oldid=512139246 Contributors: 2607:F140:400:1036:FA1E:DFFF:FEE6:501A, A5, Adfernandes, Amit Moscovich, Azeari, Azhag, BSVulturis, Barak, Ben Ben, Bender2k14, Benwing, BrotherE, Btyner, Charles Matthews, ChrisGualtieri, Coffee2theorems, Crasshopper, Cretog8, Daf, Dlwhall, Drevicko, Dycotiles, Erikerhardt, Finnancier, Franktuyl, Frigyik, Giftlite, Gxhrid, Herve1729, Ipeirotis, Ivan.Savov, J04n, Josang, Kzollman, Liuyipei, M0nkey, Maiermarco, Mandarax, MarkSweep, Mathknightapprentice, Mathstat, Mavam, Mcld, Melcombe, Michael Hardy, MisterSheik, Mitch3, Myasuda, Nbarth, Oscar Tckstrm, Prasenjitmukherjee, Qwfp, Robinh, Rvencio, Salgueiro, Saturdayswiki, Schmock, Shaharsh, Sinuhet, Slxu.public, Tomi, Tomixdf, Tuonawa, Wavelength, Whosasking, Wjastle, Wolfman, Zvika, 76 anonymous edits

Article Sources and Contributors


Uniform distribution (continuous) Source: http://en.wikipedia.org/w/index.php?oldid=509539421 Contributors: A.M.R., Abdullah Chougle, Aegis Maelstrom, Albmont, AlekseyP, Algebraist, Amatulic, ArnoldReinhold, B k, Baccyak4H, Benlisquare, Brianga, Brumski, Btyner, Capricorn42, Cburnett, Ceancata, DaBler, DixonD, DrMicro, Duoduoduo, Euchiasmus, Fasten, FilipeS, Gala.martin, Gareth Owen, Giftlite, Gilliam, Gritzko, Henrygb, Iae, Iwaterpolo, Jamelan, Jitse Niesen, Marie Poise, Melcombe, Michael Hardy, MisterSheik, Nbarth, Nsaa, Oleg Alexandrov, Ossska, PAR, Qwfp, Ray Chason, Robbyjo, Ruy Pugliesi, Ryan Vesey, Sandrarossi, Sl, Stpasha, Stwalkerster, Sun Creator, Tpb, User A1, Vilietha, Warriorman21, Wikomidia, Zundark, 97 anonymous edits Beta distribution Source: http://en.wikipedia.org/w/index.php?oldid=513408420 Contributors: Adamace123, AllenDowney, AnRtist, Arauzo, Art2SpiderXL, Awaterl, Baccyak4H, Benwing, Betadistribution, BlaiseFEgan, Bootstoots, Bryan Derksen, Btyner, Cburnett, Crasshopper, Cronholm144, DFRussia, Dean P Foster, Dr. J. Rodal, DrMicro, Dshutin, Eric Kvaalen, FilipeS, Fintor, Fnielsen, Giftlite, Gill110951, GregorB, Gruntfuterk, Gkhan, HappyCamper, Henrygb, Hilgerdenaar, Hypnotoad33, IapetusWave, ImperfectlyInformed, J04n, Jamessungjin.kim, Janlo, Jhapk, Jheald, Joriki, Josang, Ketiltrout, Krishnavedala, Kts, Ladislav Mecir, Linas, LiranKatzir, Livius3, Lovibond, MarkSweep, Mcld, Melcombe, Michael Hardy, MisterSheik, Mochan Shrestha, Mogism, Mpaa, MrOllie, Nbarth, O18, Oberobic, Ohanian, Oleg Alexandrov, Ott2, PAR, PBH, Paulginz, Pleasantville, Pnrj, Qwfp, Rcsprinter123, Rjwilmsi, Robbyjo, Robinh, Robma, Rodrigo braz, Rumping, SJP, ST47, Saric, Schmock, SharkD, Sheppa28, Steve8675309, Stoni, Sukisuki, Thumperward, Tomi, Tthrall, UndercoverAgents, Urhixidur, Wile E. Heresiarch, Wjastle, YearOfGlad, 164 anonymous edits Normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=512226312 Contributors: 119, 194.203.111.xxx, 213.253.39.xxx, 5:40, A. Parrot, A. Pichler, A.M.R., AaronSw, Abecedare, Abtweed98, Alektzin, Alex.j.flint, Ali Obeid, AllanBz, Alpharigel, Amanjain, Andraaide, AndrewHowse, Anna Lincoln, Appoose, Art LaPella, Asitgoes, Aude, Aurimus, Awickert, AxelBoldt, Aydee, Aylex, Baccyak4H, Beetstra, BenFrantzDale, Benwing, Bhockey10, Bidabadi, Bluemaster, Bo Jacoby, Boreas231, Boxplot, Br43402, Brock, Bryan Derksen, Bsilverthorn, Btyner, Bubba73, Burn, CBM, CRGreathouse, Calvin 1998, Can't sleep, clown will eat me, CapitalR, Cburnett, Cenarium, Cgibbard, Charles Matthews, Charles Wolf, Cherkash, Cheungpuiho04, Chill doubt, Chris53516, ChrisHodgesUK, Christopher Parham, Ciphergoth, Cmglee, Coffee2theorems, ComputerPsych, Conversion script, Coolhandscot, Coppertwig, Coubure, Courcelles, Crescentnebula, Cruise, Cwkmail, Cybercobra, DEMcAdams, DFRussia, DVdm, Damian Yerrick, DanSoper, Dannya222, Darwinek, David Haslam, David.hilton.p, DavidCBryant, Davidiad, Den fjttrade ankan, Denis.arnaud, Derekleungtszhei, Dima373, Dj thegreat, Doood1, DrMicro, Drilnoth, Drostie, Dudzcom, Duoduoduo, Dzordzm, EOBarnett, Eclecticos, Ed Poor, Edin1, Edokter, EelkeSpaak, Egorre, Elektron, Elockid, Enochlau, Epbr123, Eric Kvaalen, Ericd, Evan Manning, Fang Aili, Fangz, Fergusq, Fgnievinski, Fibonacci, FilipeS, Fintor, Firelog, Fjdulles, Fledylids, Fnielsen, Fresheneesz, G716, GB fan, Galastril, Gandrusz, Gary King, Gauravm1312, Gauss, Geekinajeep, Gex999, GibboEFC, Giftlite, Gigemag76, Gil Gamesh, Gioto, GordontheGorgon, Gperjim, Graft, Graham87, Gunnar Larsson, Gzornenplatz, Gkhan, Habbie, Headbomb, Heimstern, Henrygb, HereToHelp, Heron, Hiihammuk, Hiiiiiiiiiiiiiiiiiiiii, Hu12, Hughperkins, Hugo gasca aragon, I dream of horses, Ian Pitchford, IdealOmniscience, Ikanreed, It Is Me Here, Itsapatel, Ivan tambuk, Iwaterpolo, J heisenberg, JA(000)Davidson, JBancroftBrown, JaGa, Jackzhp, Jacobolus, JahJah, JanSuchy, Jason.yosinski, Javazen, Jeff560, Jeffjnet, Jgonion, Jia.meng, Jim.belk, Jitse Niesen, Jmlk17, Joebeone, Jonkerz, Jorgenumata, Joris Gillis, Jorisverbiest, Josephus78, Josuechan, Jpk, Jpsauro, Junkinbomb, KMcD, KP-Adhikari, Karl-Henner, Kaslanidi, Kastchei, Kay Dekker, Keilana, KipKnight, Kjtobo, Knutux, LOL, Lansey, Laurifer, Lee Daniel Crocker, Leon7, Lilac Soul, Livius3, Lixy, Loadmaster, Lpele, Lscharen, Lself, MATThematical, MIT Trekkie, ML5, Manticore, MarkSweep, Markhebner, Markus Krtzsch, Marlasdad, Mateoee, Mathstat, Mcorazao, Mdebets, Mebden, Meelar, Melcombe, Message From Xenu, Michael Hardy, Michael Zimmermann, Miguel, Mikael Hggstrm, Mikewax, Millerdl, Mindmatrix, MisterSheik, Mkch, Mm 202, Morqueozwald, Mr Minchin, Mr. okinawa, MrKIA11, MrOllie, MrZeebo, Mrocklin, Mundhenk, Mwtoews, Mysteronald, Naddy, Nbarth, Netheril96, Nicholasink, Nicolas1981, Nilmerg, NoahDawg, Noe, Nolanbard, NuclearWarfare, O18, Ohnoitsjamie, Ojigiri, Oleg Alexandrov, Oliphaunt, Olivier, Orderud, Ossiemanners, Owenozier, P.jansson, PAR, PGScooter, Pablomme, Pabristow, Paclopes, Patrick, Paul August, Paulpeeling, Pcody, Pdumon, Personman, Petri Krohn, Pfeldman, Pgan002, Pinethicket, Piotrus, Plantsurfer, Plastikspork, Policron, Polyester, Prodego, Prumpf, Ptrf, Qonnec, Quietbritishjim, Qwfp, R.J.Oosterbaan, R3m0t, RDBury, RHaworth, RSStockdale, Rabarberski, Rajah, Rajasekaran Deepak, Randomblue, Rbrwr, Renatokeshet, RexNL, Rich Farmbrough, Richwales, Rishi.bedi, Rjwilmsi, Robbyjo, Robma, Romanski, Ronz, Rubicon, RxS, Ryguasu, SGBailey, SJP, Saintrain, SamuelTheGhost, Samwb123, Sander123, Schbam, Schmock, Schwnj, Scohoust, Seaphoto, Seidenstud, Seliopou, Seraphim, Sergey Suslov, SergioBruno66, Shabbychef, Shaww, Shuhao, Siddiganas, Sirex98, Smidas3, Snoyes, Sole Soul, Somebody9973, Stan Lioubomoudrov, Stephenb, Stevan White, Stpasha, StradivariusTV, Sullivan.t.j, Sun Creator, SusanLarson, Sverdrup, Svick, Talgalili, Taxman, Tdunning, TeaDrinker, The Anome, The Tetrast, TheSeven, Thekilluminati, TimBentley, Tomeasy, Tomi, Tommy2010, Tony1, Trewin, Tristanreid, Trollderella, Troutinthemilk, Tryggvi bt, Tschwertner, Tstrobaugh, Unyoyega, Vakulgupta, Velocidex, Vhlafuente, Vijayarya, Vinodmp, Vrkaul, Waagh, Wakamex, Wavelength, Why Not A Duck, Wikidilworth, Wile E. Heresiarch, Wilke, Will Thimbleby, Willking1979, Winsteps, Wissons, Wjastle, Wwoods, XJamRastafire, Yoshigev, Yoshis88, Zaracattle, Zbxgscqf, Zero0000, Zhurov, Zrenneh, Zundark, Zvika, , 749 anonymous edits Student's t-distribution Source: http://en.wikipedia.org/w/index.php?oldid=513292695 Contributors: 3mta3, A bit iffy, A. di M., A.M.R., Addone, Aetheling, Afluent Rider, Albmont, AlexAlex, Alvin-cs, Amonet, Arsenikk, Arthur Rubin, Asperal, Avraham, AxelBoldt, B k, Beetstra, Benwing, Bless sins, Bobo192, BradBeattie, Bryan Derksen, Btyner, CBM, Cburnett, Chiqago, Chris53516, Chriscf, Classical geographer, Clbustos, Coppertwig, Count Iblis, Crouchy7, Daige, DanSoper, Danko Georgiev, Daveswahl, Dchristle, Ddxc, Dejo, Dkf11, Dmcalist, Dmcg026, Duncharris, EPadmirateur, EdJohnston, Eleassar, Eric Kvaalen, Ethan, Everettr2, F.morett, Fgimenez, Finnancier, Fnielsen, Frankmeulenaar, [email protected], Furrykef, G716, Gabrielhanzon, Giftlite, Gperjim, Guibber, Hadleywickham, Hankwang, Hemanshu, Hirak 99, History Sleuth, Huji, Icairns, Ichbin-dcw, Ichoran, Ilhanli, Iwaterpolo, JMiall, JamesBWatson, Jitse Niesen, Jmk, John Baez, Johnson Lau, Jost Riedel, Kastchei, Kiefer.Wolfowitz, Koavf, Kotar, Kroffe, Kummi, Kyosuke Aoki, Lifeartist, Linas, Lvzon, M.S.K., MATThematical, Madcoverboy, Maelgwn, Mandarax, MarkSweep, Mcarling, Mdebets, Melcombe, Michael C Price, Michael Hardy, Mig267, Millerdl, MisterSheik, MrOllie, Muzzamo, Nbarth, Netheril96, Ngwt, Nick Number, NuclearWarfare, O18, Ocorrigan, Oliphaunt, PAR, PBH, Pegasus1457, Petter Strandmark, Phb, Piotrus, Pmanderson, Quietbritishjim, Qwfp, R'n'B, R.e.b., Rich Farmbrough, Rjwilmsi, Rlendog, Robert Ham, Robinh, Royote, Salgueiro, Sam Derbyshire, Sander123, Scientific29, Secretlondon, Seglea, Serdagger, Sgb 85, Shaww, Shoefly, Skbkekas, Sonett72, Sougandh, Sprocketoreo, Srbislav Nesic, Stasyuha, Steve8675309, Stpasha, Strait, TJ0513, Techman224, Tgambon, The Anome, Theodork, Thermochap, ThorinMuglindir, Tjfarrar, Tolstoy the Little Black Cat, Tom.Reding, TomCerul, Tomi, Tutor dave, Uncle G, Unknown, User A1, Valravn, Velocidex, Waldo000000, Wastle, Wavelength, Wikid77, Wile E. Heresiarch, Xenonice, ZantTrang, 295 anonymous edits Gamma distribution Source: http://en.wikipedia.org/w/index.php?oldid=513264390 Contributors: 2001:6B0:1:12B0:7589:EAB6:F044:6F88, 2001:6B0:1:12B0:B11C:6521:2ABD:61AC, A5, Aastrup, Abtweed98, Adam Clark, Adfernandes, Albmont, Amonet, Aple123, Apocralyptic, Arg, Asteadman, Autopilot, Baccyak4H, Barak, Bdmy, Benwing, Berland, Bethb88, Bo Jacoby, Bobmath, Bobo192, Brenton, Bryan Derksen, Btyner, CanadianLinuxUser, CapitalR, Cburnett, Cerberus0, ClaudeLo, Cmghim925, Complex01, Darin, David Haslam, Dicklyon, Dlituiev, Dobromila, Donmegapoppadoc, DrMicro, Dshutin, Entropeneur, Entropeter, Erik144, Eug, Fangz, Fnielsen, Frau K, Frobnitzem, Gaius Cornelius, Gandalf61, Gauss, Giftlite, Gjnaasaa, Henrygb, Hgkamath, Iwaterpolo, Jason Goldstick, Jirka6, Jlc46, JonathanWilliford, Jshadias, Kastchei, Langmore, Linas, Lovibond, LoyalSoldier, LukeSurl, Luqmanskye, MarkSweep, Mathstat, Mcld, Mebden, Melcombe, Mich8611, Michael Hardy, MisterSheik, Mpaa, MrOllie, MuesLee, Mundhenk, Narc813, Nickfeng88, O18, PAR, PBH, Patrke, Paul Pogonyshev, Paulginz, Phil Boswell, Pichote, Policron, Popnose, Qiuxing, Quietbritishjim, Qwfp, Qzxpqbp, RSchlicht, Robbyjo, Robinh, Rockykumar1982, Samsara, Sandrobt, Schmock, Smmurphy, Stephreg, Stevvers, Sun Creator, Supergrane, Talgalili, Tayste, TestUser001, Thomas stieltjes, Thric3, Tom.Reding, Tomi, Tommyjs, True rover, Umpi77, User A1, Vminin, Wavelength, Wiki me, Wiki5d, Wikid77, Wile E. Heresiarch, Wjastle, Xuehuit, Zvika, 317 anonymous edits Pareto distribution Source: http://en.wikipedia.org/w/index.php?oldid=510200188 Contributors: A. B., Alexxandros, Am rods, Antandrus, Asitgoes, Avraham, AxelBoldt, Beland, BenKovitz, Benwing, Boxplot, Bryan Derksen, Btyner, Bubbleboys, Buenas das, Carrionluggage, Chadernook, ChemGardener, Chuanren, Clark Kent, Cogiati, Courcelles, Cyberyder, DaveApter, David Haslam, Doobliebop, DrMicro, Dreftymac, EdChem, Edward, Enigmaman, Eric Kvaalen, Fenice, Fentlehan, Financestudent, Fpoursafaei, Giftlite, Glenn, Gruzd, Headbomb, Henrygb, Heron, Hgkamath, Hve, Ida Shaw, Isheden, Iwaterpolo, J sandstrom, JavOs, Jive Dadson, Joseph Solis in Australia, Joxemai, Lendu, LunaDeFerrari, Mack2, Mange01, MarkSweep, Marsianus, Mathstat, Mcld, Melchoir, Melcombe, Mgil83, Michael Hardy, Mindmatrix, MisterSheik, Msghani, Nbarth, Noe, Nowa, O18, Olaf, P3^1$Problems, PAR, PBH, Paintitblack ft, Pankaj303er, Paul Pogonyshev, Phil Boswell, Philip Trueman, PhysPhD, Plutologist, Probabilityislogic, Purple Post-its, Qwfp, R.J.Oosterbaan, Reedy, Rjwilmsi, Rlendog, Rock soup, Rror, SanderSpek, Scalimani, Sergey Suslov, Shell Kinney, Shomoita, Srich32977, Stpasha, Tatrgel, The Anome, Tomi, UKoch, Undsoweiter, User A1, Vivacissamamente, Vyznev Xnebara, Wjastle, Wprestong, 154 anonymous edits Inverse-gamma distribution Source: http://en.wikipedia.org/w/index.php?oldid=511619843 Contributors: Benwing, Biostatprof, Btyner, Cburnett, Cquike, Damiano.varagnolo, Dstivers, Fnielsen, Giftlite, Greenw2, Iwaterpolo, Josevellezcaldas, Kastchei, M27315, MarkSweep, Melcombe, MisterSheik, PAR, Qwfp, [email protected], Rlendog, Rphlypo, Shadowjams, Sheppa28, Slavatrudu, Tomi, User A1, Wjastle, 44 anonymous edits Chi-squared distribution Source: http://en.wikipedia.org/w/index.php?oldid=513266111 Contributors: A.R., AaronSw, AdamSmithee, Afa86, Alvin-cs, Analytics447, Animeronin, Ap, AstroWiki, AxelBoldt, BenFrantzDale, BiT, Blaisorblade, Bluemaster, Bryan Derksen, Btyner, CBM, Cburnett, Chaldor, Chris53516, Constructive editor, Control.valve, DanSoper, Dbachmann, Dbenbenn, Den fjttrade ankan, Dgw, Digisus, DrMicro, Drhowey, EOBarnett, Eliel Jimenez, Eliezg, Emilpohl, Etoombs, Ettrig, Fergikush, Fibonacci, Fieldday-sunday, FilipeS, Fintor, G716, Gaara144, Gauss, Giftlite, Gperjim, Henrygb, Herbee, Hgamboa, HyDeckar, Iav, Icseaturtles, Isopropyl, It Is Me Here, Iwaterpolo, J-stan, Jackzhp, Jaekrystyn, Jason Goldstick, Jdgilbey, Jitse Niesen, Johnlemartirao, Johnlv12, Jspacemen01-wiki, Kastchei, Knetlalala, KnightRider, Kotasik, LeilaniLad, Leotolstoy, LilHelpa, Lixiaoxu, Loodog, Loren.wilton, Lovibond, MATThematical, MER-C, MarkSweep, Markg0803, Master of Puppets, Materialscientist, Mcorazao, Mdebets, Melcombe, Mgiganteus1, Michael Hardy, Microball, Mikael Hggstrm, Mindmatrix, MisterSheik, MrOllie, MtBell, Nbarth, Neon white, Nm420, NocturneNoir, Notatoad, O18, Oleg Alexandrov, PAR, Pabristow, Pahan, Paul August, Paulginz, Pet3ris, Philten, Policron, Pstevens, Qiuxing, Quantling, Quietbritishjim, Qwfp, Rflrob, Rich Farmbrough, Rigadoun, Robinh, Ronz, Saippuakauppias, Sam Blacketer, SamuelTheGhost, Sander123, Schmock, Schwnj, Seglea, Shadowjams, Sheppa28, Shoefly, Sietse, Silly rabbit, Sligocki, Stephen C. Carlson, Steve8675309, Stpasha, Talgalili, Tarkashastri, The Anome, TheProject, TimBentley, Tom.Reding, Tombomp, Tomi, TomyDuby, Tony1, U+003F, User A1, Volkan.cevher, Wasell, Wassermann7, Weialawaga, Willem, Wjastle, Xnn, Zero0000, Zfr, Zvika, 295 anonymous edits F-distribution Source: http://en.wikipedia.org/w/index.php?oldid=484010808 Contributors: Adouzzy, Albmont, Amonet, Art2SpiderXL, Arthena, Bluemaster, Brenda Hmong, Jr, Bryan Derksen, Btyner, Califasuseso, Cburnett, DanSoper, DarrylNester, Dysprosia, Elmer Clark, Emilpohl, Ethaniel, Fnielsen, Ged.R, Giftlite, Gperjim, Hectorlamadrid, Henrygb, Jan eissfeldt, Jitse Niesen, JokeySmurf, Kastchei, Livingthingdan, MarkSweep, Markjoseph125, Materialscientist, Mdebets, Melcombe, Michael Hardy, MrOllie, Nehalem, O18, Oscar, PBH, Quietbritishjim, Qwfp,

247

Article Sources and Contributors


Robinh, Salix alba, Seglea, Sheppa28, TedE, The Squicks, Timholy, Tom.Reding, Tomi, TomyDuby, UKoch, Unyoyega, Zorgkang, 50 anonymous edits Log-normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=512048348 Contributors: 2D, A. Pichler, Acct4, Albmont, Alue, Ashkax, Asitgoes, Autopilot, AxelBoldt, Baccyak4H, BenB4, Berland, Bfinn, Biochem67, Bryan Derksen, Btyner, Cburnett, Christian Damgaard, Ciberelm, Ciemo, Cleared as filed, Cmglee, ColinGillespie, Constructive editor, Danhash, David.hilton.p, DonkeyKong64, DrMicro, Encyclops, Erel Segal, Evil Monkey, Floklk, Fluctuator, Frederic Y Bois, Fredrik, Gausseliminering, Giftlite, Humanengr, Hxu, IanOsgood, IhorLviv, Isheden, Iwaterpolo, Jackzhp, Jeff3000, Jetlee0618, Jimt075, Jitse Niesen, Khukri, Kiwi4boy, Lbwhu, Leav, Letsgoexploring, LilHelpa, Lojikl, Lunch, Magioladitis, Mange01, Martarius, Martinp23, Mcld, Melcombe, Michael Hardy, Mikael Hggstrm, Mishnadar, MisterSheik, Mivus, Nehalem, Nite1010, NonDucor, Ocatecir, Occawen, Osbornd, Oxymoron83, PAR, PBH, Paul Pogonyshev, Philip Trueman, Philtime, Phoxhat, Pichote, Pontus, Porejide, Qwfp, R.J.Oosterbaan, Raddick, Rgbcmy, Rhowell77, Ricardogpn, Rjwilmsi, Rlendog, Rmaus, RobertHannah89, Safdarmarwat, Sairvinexx, Schutz, Seriousme, Sheppa28, Skunkboy74, SqueakBox, Sterrys, Stigin, Stpasha, Ta bu shi da yu, Techman224, The Siktath, Till Riffert, Tkinias, Tomi, Umpi, Unyoyega, Urhixidur, User A1, Vincent Semeria, Wavelength, Weialawaga, Wikomidia, Wile E. Heresiarch, Wjastle, Zachlipton, ZeroOne, ^demon, 202 ,- anonymous edits Exponential distribution Source: http://en.wikipedia.org/w/index.php?oldid=513264885 Contributors: 2610:10:20:216:225:FF:FEF4:CAAF, A.M.R., A3r0, ActivExpression, Aiden Fisher, Amonet, Asitgoes, Avabait, Avraham, AxelBoldt, Bdmy, Beaumont, Benwing, Boriaj, Bryan Derksen, Btyner, Butchbrody, CD.Rutgers, CYD, Calmer Waters, CapitalR, Cazort, Cburnett, Closedmouth, Coffee2theorems, Cyp, Dcljr, Dcoetzee, Decrypt3, Den fjttrade ankan, Dudubur, Duoduoduo, Edward, Enchanter, Erzbischof, Fvw, Gauss, Giftlite, GorillaWarfare, Grinofadrunkwoman, Headbomb, Henrygb, Hsne, Hyoseok, IanOsgood, Igny, Ilmari Karonen, Isheden, Isis, Iwaterpolo, Jason Goldstick, Jester7777, Johndburger, Kan8eDie, Kappa, Karl-Henner, Kastchei, Kyng, LOL, MStraw, MarkSweep, Markjoseph125, Mattroberts, Mcld, Mdf, MekaD, Melcombe, Memming, Michael Hardy, Mindmatrix, MisterSheik, Monsterman222, Mpaa, Mwanner, Nothlit, Oysindi, PAR, Paul August, Policron, Qwfp, R'n'B, R.J.Oosterbaan, Remohammadi, Rich Farmbrough, Rp, Scortchi, Sergey Suslov, Shaile, Sheppa28, Shingkei, Skbkekas, Skittleys, Smack, Spartanfox86, Sss41, Stpasha, Taral, Taw, The Thing That Should Not Be, Thegeneralguy, TimBentley, Tomi, UKoch, Ularevalo98, User A1, Vsmith, WDavis1911, Wilke, Wjastle, Woohookitty, Wyatts, Yoyod, Z.E.R.O., Zeno of Elea, Zeycus, Zvika, Zzxterry, 210 anonymous edits Multivariate normal distribution Source: http://en.wikipedia.org/w/index.php?oldid=511661440 Contributors: A3 nm, Alanb, Arvinder.virk, AussieLegend, AxelBoldt, BenFrantzDale, Benwing, BernardH, BlueScreenD, Breno, Bryan Derksen, Btyner, Cburnett, Cfp, ChristophE, Chromaticity, Ciphergoth, Coffee2theorems, Colin Rowat, Dannybix, Delirium, Delldot, Derfugu, Distantcube, Eamon Nerbonne, Giftlite, Hongooi, HyDeckar, Isch de, J heisenberg, Jackzhp, Jasondet, Jondude11, Jorgenumata, Josuechan, KHamsun, Kaal, Karam.Anthony.K, KipKnight, KrodMandooon, KurtSchwitters, Lambiam, Lockeownzj00, Longbiaochen, MER-C, Marc.coram, MarkSweep, Mathstat, Mauryaan, MaxSem, Mcld, Mct mht, Mdf, Mebden, Meduz, Melcombe, Michael Hardy, Miguel, Mjdslob, Moonkey, Moriel, Mrwojo, Myasuda, Nabla, Ninjagecko, O18, Ogo, Oli Filth, Omrit, Opabinia regalis, Orderud, Paul August, Peni, PhysPhD, Picapica, Pycoucou, Quantling, Qwfp, R'n'B, Rely2004609, Riancon, Rich Farmbrough, Richardcherron, RickK, Rjwilmsi, Robinh, Rumping, Sanders muc, SebastianHelm, Selket, Set theorist, SgtThroat, Sigma0 1, SimonFunk, Sinuhet, Steve8675309, Stpasha, Strashny, Sun Creator, Tabletop, Talgalili, TedPavlic, Toddst1, Tommyjs, Ulner, Velocitay, Viraltux, Waldir, Wavelength, Wikomidia, Winterfors, Winterstein, Wjastle, Yoderj, Zelda, Zero0000, Zvika, , 213 anonymous edits Wishart distribution Source: http://en.wikipedia.org/w/index.php?oldid=510323706 Contributors: 3mta3, Aetheling, Aleenf1, Amonet, AtroX Worf, Baccyak4H, Benwing, Bryan Derksen, Btyner, Crusoe8181, David Eppstein, Deacon of Pndapetzim, Dean P Foster, Entropeneur, Erki der Loony, Gammalgubbe, Giftlite, Headbomb, Ixfd64, Joriki, Jrennie, Kastchei, Kiefer.Wolfowitz, Kurtitski, Lockeownzj00, MDSchneider, Melcombe, Michael Hardy, Mishnadar, P omega sigma, P.wirapati, Panosmarko, Perturbationist, PhysPhD, Qwfp, R'n'B, Robbyjo, Robinh, Ryker, Shae, Srbauer, TNeloms, Tom.Reding, Tomi, WhiteHatLurker, Wjastle, Zvika, 55 anonymous edits

248

Image Sources, Licenses and Contributors

249

Image Sources, Licenses and Contributors


File:Dice Distribution (bar).svg Source: http://en.wikipedia.org/w/index.php?title=File:Dice_Distribution_(bar).svg License: Public Domain Contributors: Tim Stellmach File:Standard deviation diagram.svg Source: http://en.wikipedia.org/w/index.php?title=File:Standard_deviation_diagram.svg License: Creative Commons Attribution 2.5 Contributors: Mwtoews File:Discrete probability distrib.svg Source: http://en.wikipedia.org/w/index.php?title=File:Discrete_probability_distrib.svg License: Public Domain Contributors: Oleg Alexandrov File:Discrete probability distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Discrete_probability_distribution.svg License: Public Domain Contributors: Incnis Mrsi (talk) File:Normal probability distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Normal_probability_distribution.svg License: Public Domain Contributors: Incnis Mrsi (talk) File:Mixed probability distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Mixed_probability_distribution.svg License: Public Domain Contributors: Incnis Mrsi (talk) File:Binomial distribution pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_pmf.svg License: Public Domain Contributors: Tayste File:Binomial distribution cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_distribution_cdf.svg License: Public Domain Contributors: Tayste File:Binomial Distribution.PNG Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.PNG License: unknown Contributors: Schlurcher File:Pascal's triangle; binomial distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Pascal's_triangle;_binomial_distribution.svg License: Public Domain Contributors: Lipedia File:Binomial Distribution.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_Distribution.svg License: GNU Free Documentation License Contributors: cflm (talk) Image:DUniform distribution PDF.png Source: http://en.wikipedia.org/w/index.php?title=File:DUniform_distribution_PDF.png License: GNU Free Documentation License Contributors: EugeneZelenko, PAR, WikipediaMaster Image:Dis Uniform distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Dis_Uniform_distribution_CDF.svg License: GNU General Public License Contributors: en:User:Pdbailey, traced by User:Stannered File:poisson pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_pmf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:poisson cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Poisson_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:Binomial versus poisson.svg Source: http://en.wikipedia.org/w/index.php?title=File:Binomial_versus_poisson.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Sergio01 Image:Beta-binomial distribution pmf.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_distribution_pmf.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Nschuma Image:Beta-binomial cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta-binomial_cdf.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: Nschuma File:Negbinomial.gif Source: http://en.wikipedia.org/w/index.php?title=File:Negbinomial.gif License: Public Domain Contributors: Stpasha File:geometric pmf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_pmf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:geometric cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:2D-simplex.svg Source: http://en.wikipedia.org/w/index.php?title=File:2D-simplex.svg License: Public Domain Contributors: Tosha Image:Dirichlet distributions.png Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_distributions.png License: Public domain Contributors: en:User:ThG Image:Dirichlet example.png Source: http://en.wikipedia.org/w/index.php?title=File:Dirichlet_example.png License: Public Domain Contributors: Mitch3 image:Uniform distribution PDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_PDF.png License: Public Domain Contributors: EugeneZelenko, It Is Me Here, Joxemai, PAR image:Uniform distribution CDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_CDF.png License: Public Domain Contributors: EugeneZelenko, Joxemai, PAR Image:Beta distribution pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_pdf.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Krishnavedala Image:Beta distribution cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_cdf.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Krishnavedala File:CDF for symmetric Beta distribution vs. x and alpha=beta - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:CDF_for_symmetric_Beta_distribution_vs._x_and_alpha=beta_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:CDF for skewed Beta distribution vs. x and beta= 5 alpha - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:CDF_for_skewed_Beta_distribution_vs._x_and_beta=_5_alpha_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Mode Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Median Beta Distribution for alpha and beta from 0 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Median_Beta_Distribution_for_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:(Mean - Median) for Beta distribution versus alpha and beta from 0 to 2 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:(Mean_-_Median)_for_Beta_distribution_versus_alpha_and_beta_from_0_to_2_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Relative Error for Approximation to Median of Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Relative_Error_for_Approximation_to_Median_of_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Error in Median Apprx. relative to Mean-Mode distance for Beta Distribution with alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Error_in_Median_Apprx._relative_to_Mean-Mode_distance_for_Beta_Distribution_with_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Mean Beta Distribution for alpha and beta from 0 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mean_Beta_Distribution_for_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:(Mean - GeometricMean) for Beta Distribution versus alpha and beta from 0 to 2 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:(Mean_-_GeometricMean)_for_Beta_Distribution_versus_alpha_and_beta_from_0_to_2_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Geometric Means for Beta distribution Purple=G(X), Yellow=G(1-X), smaller values alpha and beta in front - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_Means_for_Beta_distribution_Purple=G(X),_Yellow=G(1-X),_smaller_values_alpha_and_beta_in_front_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Geometric Means for Beta distribution Purple=G(X), Yellow=G(1-X), larger values alpha and beta in front - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Geometric_Means_for_Beta_distribution_Purple=G(X),_Yellow=G(1-X),_larger_values_alpha_and_beta_in_front_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Harmonic mean for Beta distribution for alpha and beta ranging from 0 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Harmonic_mean_for_Beta_distribution_for_alpha_and_beta_ranging_from_0_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal

Image Sources, Licenses and Contributors


File:(Mean - HarmonicMean) for Beta distribution versus alpha and beta from 0 to 2 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:(Mean_-_HarmonicMean)_for_Beta_distribution_versus_alpha_and_beta_from_0_to_2_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), smaller values alpha and beta in front - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Harmonic_Means_for_Beta_distribution_Purple=H(X),_Yellow=H(1-X),_smaller_values_alpha_and_beta_in_front_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Harmonic Means for Beta distribution Purple=H(X), Yellow=H(1-X), larger values alpha and beta in front - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Harmonic_Means_for_Beta_distribution_Purple=H(X),_Yellow=H(1-X),_larger_values_alpha_and_beta_in_front_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Variance for Beta Distribution for alpha and beta ranging from 0 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Variance_for_Beta_Distribution_for_alpha_and_beta_ranging_from_0_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Beta distribution log geometric variances front view - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_log_geometric_variances_front_view_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Beta distribution log geometric variances back view - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Beta_distribution_log_geometric_variances_back_view_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Ratio of Mean Abs. Dev. to Std.Dev. Beta distribution with alpha and beta from 0 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Ratio_of_Mean_Abs._Dev._to_Std.Dev._Beta_distribution_with_alpha_and_beta_from_0_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Ratio of Mean Abs. Dev. to Std.Dev. Beta distribution vs. nu from 0 to 10 and vs. mean - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Ratio_of_Mean_Abs._Dev._to_Std.Dev._Beta_distribution_vs._nu_from_0_to_10_and_vs._mean_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Skewness for Beta Distribution as a function of the variance and the mean - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Skewness_for_Beta_Distribution_as_a_function_of_the_variance_and_the_mean_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Skewness Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Skewness Beta Distribution for alpha and beta from .1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_alpha_and_beta_from_.1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Excess Kurtosis for Beta Distribution as a function of variance and mean - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_as_a_function_of_variance_and_mean_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Excess Kurtosis for Beta Distribution with alpha and beta ranging from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Excess Kurtosis for Beta Distribution with alpha and beta ranging from 0.1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_for_Beta_Distribution_with_alpha_and_beta_ranging_from_0.1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Re(CharacteristicFunction) Beta Distr alpha=beta from 0 to 25 Back - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristicFunction)_Beta_Distr_alpha=beta_from_0_to_25_Back_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Re(CharacteristicFunc) Beta Distr alpha=beta from 0 to 25 Front- J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristicFunc)_Beta_Distr_alpha=beta_from_0_to_25_Front-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Re(CharacteristFunc) Beta Distr alpha from 0 to 25 and beta=alpha+0.5 Back - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Re(CharacteristFunc)_Beta_Distr_alpha_from_0_to_25_and_beta=alpha+0.5_Back_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Re(CharacterFunc) Beta Distrib. beta from 0 to 25, alpha=beta+0.5 Back - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Re(CharacterFunc)_Beta_Distrib._beta_from_0_to_25,_alpha=beta+0.5_Back_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Re(CharacterFunc) Beta Distr. beta from 0 to 25, alpha=beta+0.5 Front - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Re(CharacterFunc)_Beta_Distr._beta_from_0_to_25,_alpha=beta+0.5_Front_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal Image:Logit.png Source: http://en.wikipedia.org/w/index.php?title=File:Logit.png License: GNU Free Documentation License Contributors: Darapti, Maksim File:Differential Entropy Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Differential Entropy Beta Distribution for alpha and beta from 0.1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_for_alpha_and_beta_from_0.1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Mean Median Difference - Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mean_Median_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Mean Mode Difference - Beta Distribution for alpha and beta from 1 to 5 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mean_Mode_Difference_-_Beta_Distribution_for_alpha_and_beta_from_1_to_5_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Mean, Median, Geometric Mean and Harmonic Mean for Beta distribution with alpha = beta from 0 to 5 - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Mean,_Median,_Geometric_Mean_and_Harmonic_Mean_for_Beta_distribution_with_alpha_=_beta_from_0_to_5_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:(alpha and beta) Parameter estimates vs. excess Kurtosis and (squared) Skewness Beta distribution - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:(alpha_and_beta)_Parameter_estimates_vs._excess_Kurtosis_and_(squared)_Skewness_Beta_distribution_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Inflexion points Beta Distribution alpha and beta ranging from 0 to 5 large ptl view - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Inflexion_points_Beta_Distribution_alpha_and_beta_ranging_from_0_to_5_large_ptl_view_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Inflexion points Beta Distribution alpha and beta ranging from 0 to 5 large ptr view - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Inflexion_points_Beta_Distribution_alpha_and_beta_ranging_from_0_to_5_large_ptr_view_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal

250

Image Sources, Licenses and Contributors


File:PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 30 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:PDF_for_symmetric_beta_distribution_vs._x_and_alpha=beta_from_0_to_30_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:PDF for symmetric beta distribution vs. x and alpha=beta from 0 to 2 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:PDF_for_symmetric_beta_distribution_vs._x_and_alpha=beta_from_0_to_2_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:PDF for skewed beta distribution vs. x and beta= 2.5 alpha from 0 to 9 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:PDF_for_skewed_beta_distribution_vs._x_and_beta=_2.5_alpha_from_0_to_9_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:PDF for skewed beta distribution vs. x and beta= 5.5 alpha from 0 to 9 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:PDF_for_skewed_beta_distribution_vs._x_and_beta=_5.5_alpha_from_0_to_9_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:PDF for skewed beta distribution vs. x and beta= 8 alpha from 0 to 10 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:PDF_for_skewed_beta_distribution_vs._x_and_beta=_8_alpha_from_0_to_10_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Max (Joint Log Likelihood per N) for Beta distribution Maxima at alpha=beta=2 - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Max_(Joint_Log_Likelihood_per_N)_for_Beta_distribution_Maxima_at_alpha=beta=2_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Max (Joint Log Likelihood per N) for Beta distribution Maxima at alpha=beta= 0.25,0.5,1,2,4,6,8 - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Max_(Joint_Log_Likelihood_per_N)_for_Beta_distribution_Maxima_at_alpha=beta=_0.25,0.5,1,2,4,6,8_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Fisher Information I(a,a) for alpha=beta vs range (c-a) and exponent alpha=beta - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Fisher_Information_I(a,a)_for_alpha=beta_vs_range_(c-a)_and_exponent_alpha=beta_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Fisher Information I(alpha,a) for alpha=beta, vs. range (c - a) and exponent alpha=beta - J. Rodal.png Source: http://en.wikipedia.org/w/index.php?title=File:Fisher_Information_I(alpha,a)_for_alpha=beta,_vs._range_(c_-_a)_and_exponent_alpha=beta_-_J._Rodal.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Arcsin density.svg Source: http://en.wikipedia.org/w/index.php?title=File:Arcsin_density.svg License: Creative Commons Attribution-Share Alike Contributors: user:Arthena File:Random_Walk_example.svg Source: http://en.wikipedia.org/w/index.php?title=File:Random_Walk_example.svg License: GNU Free Documentation License Contributors: Morn (talk) File:Uniform_distribution_PDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Uniform_distribution_PDF.png License: Public Domain Contributors: EugeneZelenko, It Is Me Here, Joxemai, PAR File:Beta Distribution beta=alpha from 1.05 to 4.95 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Beta_Distribution_beta=alpha_from_1.05_to_4.95_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Beta Distribution for beta=6-alpha and alpha ranging from 1.05 to 3 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Beta_Distribution_for_beta=6-alpha_and_alpha_ranging_from_1.05_to_3_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Beta Distribution for alpha=beta=4 and (alpha=3--+Sqrt(2), beta=6-alpha) J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Beta_Distribution_for_alpha=beta=4_and_(alpha=3--+Sqrt(2),_beta=6-alpha)_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Dr. J. Rodal File:Mode Beta Distribution for both alpha and beta greater than 1 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_both_alpha_and_beta_greater_than_1_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Mode Beta Distribution for both alpha and beta greater than 1 - another view - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Mode_Beta_Distribution_for_both_alpha_and_beta_greater_than_1_-_another_view_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Skewness Beta Distribution for mean full range and variance between 0.05 and 0.25 - Dr. J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_mean_full_range_and_variance_between_0.05_and_0.25_-_Dr._J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Skewness Beta Distribution for mean and variance both full range - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Skewness_Beta_Distribution_for_mean_and_variance_both_full_range_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Excess Kurtosis Beta Distribution with mean for full range and variance from 0.05 to 0.25 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_for_full_range_and_variance_from_0.05_to_0.25_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Excess Kurtosis Beta Distribution with mean and variance for full range - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Excess_Kurtosis_Beta_Distribution_with_mean_and_variance_for_full_range_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Differential Entropy Beta Distribution with mean from 0.2 to 0.8 and variance from 0.01 to 0.09 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.2_to_0.8_and_variance_from_0.01_to_0.09_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Differential Entropy Beta Distribution with mean from 0.3 to 0.7 and variance from 0 to 0.2 - J. Rodal.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Differential_Entropy_Beta_Distribution_with_mean_from_0.3_to_0.7_and_variance_from_0_to_0.2_-_J._Rodal.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Dr. J. Rodal File:Karl Pearson 2.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Karl_Pearson_2.jpg License: Public Domain Contributors: User:Struthious Bandersnatch File:Normal Distribution PDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_PDF.svg License: Public Domain Contributors: Inductiveload File:Normal Distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Normal_Distribution_CDF.svg License: Public Domain Contributors: Inductiveload File:standard deviation diagram.svg Source: http://en.wikipedia.org/w/index.php?title=File:Standard_deviation_diagram.svg License: Creative Commons Attribution 2.5 Contributors: Mwtoews File:De moivre-laplace.gif Source: http://en.wikipedia.org/w/index.php?title=File:De_moivre-laplace.gif License: Public Domain Contributors: Stpasha File:Dice sum central limit theorem.svg Source: http://en.wikipedia.org/w/index.php?title=File:Dice_sum_central_limit_theorem.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Cmglee File:QHarmonicOscillator.png Source: http://en.wikipedia.org/w/index.php?title=File:QHarmonicOscillator.png License: GNU Free Documentation License Contributors: en:User:FlorianMarquardt File:Fisher iris versicolor sepalwidth.svg Source: http://en.wikipedia.org/w/index.php?title=File:Fisher_iris_versicolor_sepalwidth.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: en:User:Qwfp (original); Pbroks13 (talk) (redraw) File:FitNormDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitNormDistr.tif License: Public Domain Contributors: Buenas das File:Planche de Galton.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Planche_de_Galton.jpg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Antoine Taveneaux File:Carl Friedrich Gauss.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Carl_Friedrich_Gauss.jpg License: Public Domain Contributors: Gottlieb BiermannA. Wittmann (photo) File:Pierre-Simon Laplace.jpg Source: http://en.wikipedia.org/w/index.php?title=File:Pierre-Simon_Laplace.jpg License: Public Domain Contributors: Ashill, Ecummenic, Elcobbola, Gene.arboit, Jimmy44, Olivier2, , 1 anonymous edits

251

Image Sources, Licenses and Contributors


Image:student t pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas Image:student t cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Student_t_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas Image:T distribution 1df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_1df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:T distribution 2df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_2df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:T distribution 3df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_3df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:T distribution 5df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_5df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:T distribution 10df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_10df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:T distribution 30df.png Source: http://en.wikipedia.org/w/index.php?title=File:T_distribution_30df.png License: GNU Free Documentation License Contributors: Juiced lemon, Maksim, 1 anonymous edits Image:Gamma distribution pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_pdf.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Gamma_distribution_pdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk) Image:Gamma distribution cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Gamma_distribution_cdf.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Gamma_distribution_cdf.png: MarkSweep and Cburnett derivative work: Autopilot (talk) Image:Gamma-PDF-3D.png Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-PDF-3D.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Ronhjones Image:Gamma-KL-3D.png Source: http://en.wikipedia.org/w/index.php?title=File:Gamma-KL-3D.png License: Creative Commons Attribution-Sharealike 3.0 Contributors: User:Ronhjones File:Pareto distributionPDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Pareto_distributionPDF.png License: Public Domain Contributors: EugeneZelenko, G.dallorto, It Is Me Here, Juiced lemon, PAR, 1 anonymous edits File:Pareto distributionCDF.png Source: http://en.wikipedia.org/w/index.php?title=File:Pareto_distributionCDF.png License: Public Domain Contributors: EugeneZelenko, G.dallorto, Juiced lemon, PAR, 2 anonymous edits File:FitParetoDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitParetoDistr.tif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Buenas das File:Pareto distributionLorenz.png Source: http://en.wikipedia.org/w/index.php?title=File:Pareto_distributionLorenz.png License: GNU Free Documentation License Contributors: G.dallorto, Grafite, Juiced lemon, Magister Mathematicae, PAR, 1 anonymous edits Image:Inverse gamma pdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_pdf.png License: GNU General Public License Contributors: Alejo2083, Cburnett Image:Inverse gamma cdf.png Source: http://en.wikipedia.org/w/index.php?title=File:Inverse_gamma_cdf.png License: GNU General Public License Contributors: Alejo2083, Cburnett File:chi-square pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Geek3 File:chi-square distributionCDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Chi-square_distributionCDF.svg License: Creative Commons Zero Contributors: Philten, 2 anonymous edits Image:F distributionPDF.png Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionPDF.png License: GNU Free Documentation License Contributors: en:User:Pdbailey Image:F distributionCDF.png Source: http://en.wikipedia.org/w/index.php?title=File:F_distributionCDF.png License: GNU Free Documentation License Contributors: en:User:Pdbailey Image:Some log-normal distributions.svg Source: http://en.wikipedia.org/w/index.php?title=File:Some_log-normal_distributions.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: original by User:Par Derivative by Mikael Hggstrm from the original, File:Lognormal distribution PDF.png by User:Par Image:Lognormal distribution CDF.svg Source: http://en.wikipedia.org/w/index.php?title=File:Lognormal_distribution_CDF.svg License: Creative Commons Attribution-ShareAlike 3.0 Unported Contributors: Lognormal_distribution_CDF.png: User:PAR derivative work: Autopilot (talk) Image:Comparison mean median mode.svg Source: http://en.wikipedia.org/w/index.php?title=File:Comparison_mean_median_mode.svg License: Creative Commons Attribution-Sharealike 3.0 Contributors: Cmglee File:FitLogNormDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitLogNormDistr.tif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Buenas das File:exponential pdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_pdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:exponential cdf.svg Source: http://en.wikipedia.org/w/index.php?title=File:Exponential_cdf.svg License: Creative Commons Attribution 3.0 Contributors: Skbkekas File:Mean exp.svg Source: http://en.wikipedia.org/w/index.php?title=File:Mean_exp.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Erzbischof File:Median exp.svg Source: http://en.wikipedia.org/w/index.php?title=File:Median_exp.svg License: Creative Commons Attribution-Sharealike 3.0,2.5,2.0,1.0 Contributors: Erzbischof File:FitExponDistr.tif Source: http://en.wikipedia.org/w/index.php?title=File:FitExponDistr.tif License: Creative Commons Attribution-Sharealike 3.0 Contributors: Buenas das Image:GaussianScatterPCA.png Source: http://en.wikipedia.org/w/index.php?title=File:GaussianScatterPCA.png License: GNU Free Documentation License Contributors: Ben FrantzDale (talk) (Transferred by ILCyborg)

252

License

253

License
Creative Commons Attribution-Share Alike 3.0 Unported //creativecommons.org/licenses/by-sa/3.0/

You might also like