0% found this document useful (0 votes)
28 views5 pages

Leonard Baum PDF

The document discusses a maximization technique for estimating parameters of probabilistic functions of Markov processes, focusing on the likelihood of observed samples. It introduces transformations and inequalities to derive maximum likelihood estimates under certain constraints. The author presents several theorems to support the methodology and its applications in statistical estimation.

Uploaded by

Eduardo Almeida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views5 pages

Leonard Baum PDF

The document discusses a maximization technique for estimating parameters of probabilistic functions of Markov processes, focusing on the likelihood of observed samples. It introduces transformations and inequalities to derive maximum likelihood estimates under certain constraints. The author presents several theorems to support the methodology and its applications in statistical estimation.

Uploaded by

Eduardo Almeida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An Inequality and Associated Maximization Technique

in Statistical Estimation
for Probabilistic Functions of Markov Processes
LEONARD E. BAUM

Institute for Defense Analyses, Princeton, New Jersey

We say that {Yt} is a probabilistic function of the Markov process {Xt}


if

P(Xt+1 = j I Xt = i, Xt-1,"" Yt ,...) = ai; , i,j == 1,..., s;

P(Yt+1 = k I Xt+1 = j, Xt = i, Xt-1 ,..., Yt, Yt-1 ,...) = bilk),

i,j = 1,..., s, k = 1,..., r.

We assume that {ai;},{bii(k)} are unknown and restricted to be in the mani-


fold M
s
ai; ?: 0, L
;~1
au = 1, i = 1,..., s,

r
boCk) ?: 0, L buCk) =
k~1
1, i,j = 1,..., s.

We see a Y sample {Y1 = Y1 , Y2 = Y2'''''. YT = YT} but not an X sample


and desire to estimate {ai; , bilk)}. .

We would like to choose maximum likelihood parameter values, i.e.,


{ai; , bilk)} which maximize the probability of the observed sample {Yt}

p{y,}({ai; , boCk)}) = P({ai; , buCk)})

s
L aioaioilbioil( Yt) aili2bili2(Y2) ... aiT-l iTbiT(Yr) (1)
io.i1''' ..iT~1

where ai are initial probabilities for the Markov process. For this purpose
1

~A
2
BAUM
-
INEQUALITY AND MAXIMIZATION TECHNIQUE 3
where
we define a transformation r{aii, biiCk)} = {au, DiiCk)}of M into itself
(b) An attempt to solve the likelihood equation obtained by setting the
partial derivatives of P with respect to the au and buCk) = 0, taking due
- = Lt P(Xt = i, Xt+1=
aii . j I {ytJ, {aii, bij(k)}) account of the restraints, is indicated in the third expressions for aii and bu(k)
Lt P(Xt = 11{Yt}, {aii, biiCk)}) since the likelihood equations can be put into the form
-Lt rxtCi)(JtHU) aiibiiCYtH)
au oP/oau
- 'Lt rxtCi)(Jt(i) a..
"
" ,
= "£ i au u"P/uau
- aii oP/oaii
- Li (Iu oP/oau ' (2a) b.{k) = bilk) oP/obii(k)
" Lk buCk) oP/obii(k) .
D.{k) = LYt=k P(Xt = i, XtH = j I{Yt}, {aii, biiCk)})
THEOREM1. [1] p( r{au , bilk)}) > p({au , buCk)})unlessr{au,bilk)} =
" Lt P(Xt = i, XtH = j I{Yt}, {aii, bij(k)}) {aii, bilk)} whichis true if and only if{au, bilk)} is a criticalpoint of P, i.e.,
LYt=k rxt(i)(JtHU) aiibiiCYtH) a solution of the likelihood equations.
- Lt rxt(i)(Jt+1U)aiibii(YtH)
Note that r depends only on the first derivatives of P. Now if one moves a
bij(k) oP/obij(k)
sufficiently small distance in the gradient direction, one is guaranteed to
= Lk bij(k) oP/obiiCk) . (2b) increase P, but how small a distance depends on the second partials. It is
somewhat unexpected to find that it is possible to specify a point at which P
The second of the equivalent forms in Eqs. (2) contains quantities rxtCi), increases, without any mention of higher derivatives.
IINI in t bywhich are defined inductively forwards and backwards, respectively,
(JtU) Eagon and the author [1] originally observed that P({au, buCk)})is a homo-
I111
8 geneous polynomial of degree 2T + 1 in ai' au , buCk) and obtained the
= I result as an application of the following theorem.
rxtHU)
i=l rxtCi)aiibiiCYtH)' j = 1,...,s, t = 0, 1,..., T - 1,
1111 8
THEOREM2. [1] Let
(3)
'lil
(JtCi)= I
i=l (JtHU) aijbiiCYtH)' i = 1,..., s, t = T - 1, T - 2,..., O. P(Zl ,..., zn) = I CI"l,I"2,...,I"n
ZI"lZI"2
1 2 ... zI"n
n where cI"l,I"2,...,I"n ?'>-:0
111'1 1"1'1"2'" .,I"n
I I1II

'/111 Note multiplications.


4s2T that the rxtCi),f3t(i),
Hencei = 1,..., s, t = 0,..., T can all be computed with and ft1 + ... + ftn = d. Then
lll

8
. \ ZioP/ozi
I II

jIII P({aij , buCk)})= I (XtCi)


(JtCi)
r . {Zi} -+ ! Li Zi oPjoZi !

' i=l
~1'1
maps D : Zi ~ 0, L Zi = I into itself and satisfies P(r{zi}) ~ P{Zi}' In fact,
1'11 (identically in t) can be computed with 4S2T multiplications rather than the strict inequality holds unless {Zi} is a critical point of P in D.
II~I
2TsT+1multiplications indicated in the defining formula (1). Similarly, the
partial derivatives of P needed for defining the image in (2) are computed For the proof, the partial derivatives were evaluated as
m!
from the rx'sand (J's with a work factor linear in T, not exponential in T.
defined
Thereinare
(2):three ways of rationalizing the use of this transformation,
Zi OPjoZi = L
I"1'I"2,...,I"n
Cl"l'I"2 I"nftiZ~lz~t ... z~n

'I~
and substituted for the variables Zi in the expression for P. An elementary
(a) Bayesian a posteriori reestimation suggested the transformation r
I~
originally and is embodied in the first expressions for aii and Du(k). though very tricky juggling of the inequality between geometric and arith-
III metic means and HOlder's inequality then led to the desired result through a
In

III

(
-
BAUM
INEQUALITY AND MAXIMIZATION TECHNIQUE 5

route which cast no light on what was actually happening. The author believes
by hypothesis. Jensen's inequality is applicable to the first inequality since
the following derivation due to Baum et al. [2], which greatly generalizes
p(x, ,\) dp,(x)/P(x) is a nonnegative measure with total mass 1. Since log is
theWe
applicability of the transformation
adopt a simplified notation. We T, lays bare the essence of the situation.
write strictly concave (log" < 0), equality can hold only if p(x, A)/p(x, ,\) is constant
a.e. with respect to dp,(x).
peA) L p(x,
= "'EX A) We now have a way of increasing P('\). For each ,\ we need only find a
Awith Q(A,A) ;;:;,Q('\, '\). This may not seem any easier than directly finding
a A with peA) ;;:;,P('\). However the author shall show that under natural
where A specifies an [s - 1 + s(s -
1) + s2(r - l)]-dimensional param-
assumptions and in particular in the cases of interest:
eter point
0 1 {ai, aij,
T bij(k)} in [s + S2+ s2r]-dimensional space and
x = {Xi, Xi "'" Xi } is a sequenceof states of the unseenMarkov process. (a)For fixed '\, Q('\, '\') assumes its global maximum as a function of ,\'
The summation is over X, the space of all possible T + 1 long sequences of at a unique point T('\).
states, and p(x, A) = ai 0ai 0i1bi 0i1(Yl) ... ai T-1 iTbi T-1 iT (YT) is the probability (b) T('\) is continuous.
of the Markov process following that sequence of states and producing the
Wewrite (c) T('\) is effectively computable.
observed {Yt} sample for the parameter values {ai, aij, bij(k)}. More generally, (d) P(T('\»);;:;'P('\) which follows from Theorem 3 and the definition of
T('\) since,\' = A is one of the competitors for the global maximum
peA) = J"'EXp(x, A) dfl(A) of Q(A, A') as a function of A'.

We apply Theorem 3 to the principle case of interest. Letting {a, A, B}


where fl is a finite nonnegative measure and p(x, A)is positive a.e. with respect denote {ai' aij , bij(k)}, we have
for each
to fl. In of
thethe ST+1points
main x. of interestfl is a counting measure: fleX) = 1
application
Pea, A, B) = LP(x, a, A, B)
We wish to define a transformation T on the A-spaceand show that '"
variables where
P(T(A») > peA). For this purpose we define an auxiliary function of two
T-l T-l
Q(A,A') = J "'EXp(x, A) logp(x, A') dfl(X).
p(x, a, A, B) = a",o TI a"'t"'1+1TI b"'t"'l+l(Yt+1)'
t=o t-O
Also

p(x,THEOREM
A) = p(x,3.A) [2]
a.e. If Q(A,A) to
with respect ;;:;,Q(A,
fl. A), then peA) > peA) unless Q(a, A, B; a', A', B')

Proof. We shall apply Jensen's inequality to the concave function log x. = L


~x p(x, a, A, B) !log a~o + Lt log a~t"'t+1+ Lt log b~t"'t+1(Yt+1)!'
We wish to prove peA) ;;:;,PeA) or, equivalently, 10g[P(A)/P(A)];;:;,O. Now
peA) 1
log P(A) = log [peA) Jxp(x, A)dfl(X)] For fixed a, A, B we seek to maximize Q as a function of a', A', B'. We
observe that for a, A, B fixed, Q is a sum of three functions-one involving
only {a/}, the second involving only {a~i}'and the third involving only {b~lk)}
= 10
g
J[
X
P(x, A) dfl(X)
peA)
p(x, A)
]p(x, A)
which can be maximized separately.
We consider the second of these. Observe that

J[
>-
~ X
P(x, A) dfl(X)
peA)
]1og p(x,
p(x, A)
A) L p(x, a, A, B) Lt log a~t"'l+l =
s

L [L p(x, L
a, A, B) t:"'t=i log a~''''1+1]
"'EX i=1"'EX
1
= peA) [Q(A,A) - Q(A, A)] ;;:;, 0 is itself a sum of s functions the ith of which involves only a~i,j = 1,..., s,
which can be maximized separately. If we let nii(x) be the number of t's with
6
BAUM
INEQUALITY AND MAXIMIZATION TECHNIQUE 7

Xt =
ith i, Xt+! as= j in the sequence of states specified by x, we can write the
function distributed variable with an unknown mean mt and standard deviation at '

Now we wish to maximize the likelihood density of an observation Y1,.." YT ,


8 8

I I niix) p(x, a, A, B) log a~j = I Aij log a~j pea, A, m, a) = I pea, A, m, a, x)


"'EX
j=1 "'EX j=1
where
where Aij = L:"'EXniix) p(x, a, A, B). But
pea, A, m, a, x) = a"'oa"'o"'lb(m"'l'a"'l ' Yl) ... a"'T-l"'Tb(m"'T' a"'T' YT)'
I
j=1
Aij log a~j
With
as a function of {a;,}, subject to the restraints
8
Q(a,A, m, a, a', A', m', a') = I
p(x, a, A, m, a) logp(x, a', A', m', a')
"'EX
" a:ZJ =
L,
j=1
0 1, a;j ?: 0,
Theorem 3 applies since everything is nonnegative; it is sufficient to find ii,
A, iii, ii such that
attains a global maximum at the single point
8 Q(a, A, m, a; a, A, iii, ii) ?: Q(a, A, m, a; a, A, m, a).
iiu = Aijl I
j=1
Aij.
An argument similar to one given previously shows that:

This {iiu} agrees with the first expression of (2); i.e., THEOREM 4. [2J For each fixed {a, A, m, a}, the function Q(a, A, m, a;
t-l
a', A', m', a') attains a global maximum at a unique point. This point
t=o T(a, A, m, a), the transform of {a, A, m, a}, is given by
I P(Xt = i, Xt+l = j' {Yt}, {aij, bij(k)}) = Aij/P({Yt} I{aij, buCk)}).
Similarly we obtain - Lt rxli) aii{lt+lU) b(mj , aj ,Yt+l)
,
au = "8 " ' '

"-i-l "-t rxt(z ) atj (lt+l (] ) b (mj , aj , Yt+1 )

iii = "'o=t
Io p(x, a, A, B)II '" p(x, a, A, B),
Lt rxtU)fltU) Yt
mj=
Lt rxtU)(ltU) ,

5iik) = Ip(x, a,A, B) "'t=t''''t+l=j,Vt+l=k


I 0 l/Ip(x,
'" a, A, B) "'I=t
oL 01,
,"'1+1=' al = Lt rxlj) (ltU)(Yt - mj)2
Lt rxtU)(ltU)
in agreement with (1). Of course iii, iiu , Du(k) are computed by inductive
calculations as indicated in the second expression of (2) and in (3), not as in The last two can be interpreted, respectively, as a posteriori means and
the above formulas. variances.
We have now shown that the transformation T increases P in the case
where the output observables Y take values in a finite state space. More generally, let bey) be a strictly log concave density, i.e., (log b)" < O.
We can also consider the case [2J where the output observables Yt are real- We introduce a two-parameter family involving location and scale parameters
valued. For example, imagine that mt, at in state i by defining b(m, a, y) = b(y - m)fa) as we did for the
normal density above. The following theorem is somewhat harder to prove
. 1 -(Yt - mi)2 than the previous results for the discrete and normal output variables:
P(Yt = Y' Xt = z) = (2 7T)1/2at exp 2 at 2 = b(mi, at ,Yt);
THEOREM5. [2] For fixed a, A, m, a the function Q(a, A, m, a,
i.e., associated with state i of an unseen Markov process there is a normally
La" A', m', a') attains a global maximum at a single point (ii, A,m, ii). The
8 BAUM

transformation T(a, A, m, 0")= (ii, .4, m, u) thus defined is continuous and


p(T(a, A, m, 0"»)~ pea, A, m, 0") with equality if and only ifT(a, A, m, 0")=
(a, A, m, 0")which, in turn, holds if and only if (a, A, m, 0")is a critical point
ofP.

However, the new mi , Ui do not have obvious probabilistic interpretations


as in the normal case above. Moreover, thesemi and Uicannot be inductively
computed as in the finite and normal output cases. These facts greatly
decrease the interest in the last transformation T.
We now consider convergence properties of the iterates of the trans-
formation T. We have P(T(..\»)~ P(..\), equality holding if and only if
T(..\) = ..\ which holds if and only if ..\is a critical point of P. It follows that
if ..\0is a limit point of the sequence Tn(..\),then T(..\o)= ..\0'[In fact, if Tni --* ..\0,
then P(..\o)~ P(T(..\O»)= limi p(Tni+1(..\»)~ limi p(Tni+1(..\») = P(..\o).] We
want to conclude that Tn(..\)--* ..\0. If P has only finitely many critical points
so that T has only finitelymany fixedpoints, this followsas an elementary
point set topology exercise. However, at least theoretically, if P has infinitely
many critical points, limit cycle behavior is possible.
However, T has additional properties beyond those just used and it is
possible that a theorem guaranteeing convergence to a point is provable under
suitable hypotheses. For related material see References [3] and [4].

REFERENCES

1. L. E. BAUMANDJ. A. EAGON,An inequality with applications to statistical prediction


for functions of Markov processes and to a model for ecology. Bull. Amer. Math. Soc.
73 (1967), 360-363.
2. L. E. BAUM,T. PETRIE,G. SOULES,ANDN. WEISS,A maximization technique occurring
in the statistical analysis of probabilistic functions of Markov chains. Ann. Math. Statist.
41 (1970), 164---171.
3. G. R. BLAKELY, Homogeneous non-negative symmetric quadratic transformations.
Bull. Amer. Math. Soc. 70 (1964), 712-715.
4. L. E. BAUMANDG. R. SELL,Growth transformations for functions on manifolds.
Pacific J. Math. 27 (1968), 211-227.

You might also like