0% found this document useful (0 votes)
43 views16 pages

A Note On Sampling Graphical Markov Models - Megan Bernstein and Prasad Tetali

Uploaded by

Ma Ga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views16 pages

A Note On Sampling Graphical Markov Models - Megan Bernstein and Prasad Tetali

Uploaded by

Ma Ga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS

MEGAN BERNSTEIN AND PRASAD TETALI

Abstract. We consider sampling and enumeration problems for Markov equivalence classes. We
create and analyze a Markov chain for uniform random sampling on the DAGs inside a Markov
equivalence class. Though the worst case is exponentially slow mixing, we find a condition on the
arXiv:1705.09717v2 [math.PR] 14 Aug 2017

Markov equivalence class for polynomial time mixing. We also investigate the ratio of Markov
equivalence classes to DAGs and a Markov chain of He, Jia, and Yu for random sampling of sparse
Markov equivalence classes.
Keywords: graphical Markov model; MCMC algorithm; reversible Markov chain

1. Introduction

A Bayesian network or DAG model is a type of statistical model used to capture a causal
relationship in data. The model consists of a directed acyclic graph (DAG) and a set of (dependent)
random variables, one variable assigned to each vertex. The DAG encodes conditional independence
relations among the random variables. These models are used in areas ranging from computation
biology to artificial intelligence [3, 6, 10, 14]. However, the correct DAG for a system can only
be inferred from data up to a condition called Markov equivalence, where all DAGs in a Markov
equivalence class represent the same statistical model [12]. Model selection algorithms face a
balance between dealing with the more complicated structure of Markov equivalence classes or
encountering inefficiencies and constraints while using DAGs. Works (resulting in partial success)
towards understanding Markov equivalence classes through counting and random sampling have
considered the questions of enumeration of (1) the Markov equivalences classes on a given number
of vertices [8, 11, 19, 20], (2) the DAGs comprising a fixed Markov equivalence class [7, 9], or (3) all
Markov equivalence classes corresponding to a fixed underlying undirected graph [16, 17]. In spite
of much research, the topic of exact enumeration remains stubbornly open. In the present work,
we consider the problems of random sampling for questions (1) and (2).
The graphs in a Markov equivalence class are exactly those that share the same skeleton and
immoralities [8]: The skeleton of a directed graph is the underlying undirected graph obtained by
removing direction from all the edges. A v-structure (also termed an immorality) at c occurs among
vertices a, b, c, whenever the induced subgraph on these vertices has the two directed edges (a, c)
and (b, c) but not (a, b) or (b, a).
An essential graph is a graphical representation of a Markov equivalence class that utilizes both
directed and undirected edges. Since all graphs in a Markov equivalence class share the same
skeleton, they only differ in direction of edges. An edge is directed in the essential graph for a
School of Mathematics, Georgia Institute of Technology. Email: [email protected]. Supported in part by
NSF Grant DMS-1344199.
School of Mathematics and School of Computer Science, Georgia Institute of Technology. Email:
[email protected]. Supported in part by NSF Grant DMS-1407657.
1
2 MEGAN BERNSTEIN AND PRASAD TETALI

w u

Figure 1. Forbidden subgraph

Figure 2. Protected edges

Markov equivalence graph if that edge is directed in the same direction in all the graphs in the
equivalence class. Otherwise it is undirected. The partially directed graphs (PDAG’s) resulted
from this are distinct.
A result of Andersson et al [1] gives a characterization of the PDAGs that are essential graphs
with four conditions [11]. (1) No partially directed cycles (i.e. chain graph), (2) The subgraph
formed by taking only the undirected edges is chordal, (3) The graph in Figure 1 does not occur
as an induced subgraph, and
(4) Every directed edge is strongly protected: A directed edge u → v is strongly protected if it
occurs in one of the four induced subgraphs in Figure 2.
The essential graphs for Markov equivalence classes containing a single DAG are the essential
graphs with only directed edges. In one direction, this follows since if the class has one DAG then
all the edges are directed consistently within the class. In the other direction, if there are two
DAGs in the class they have the same skeleton and can only differ in the direction of some edge.
That edge would then be undirected in the essential graph.
A PDAG with no undirected edges has fewer conditions to be an essential graph: (1) It is a DAG
(2) All edges u → v are protected by being in one of three induced graphs (a),(b),(c).

Proposition 1. An edge u → v is protected in a PDAG with only directed edges if {w|w → u} 6=


{w|w 6= u, w → v}

Proof. If there exists a w in the first set but not the second, then u, v, w form the induced subgraph
in (b). If w is in the second set but not the first, then either u → w or not. In the first case this
forms the induced subgraph in (c), otherwise the induced subgraph of (a). 

This paper is comprised of three sections. Section 2 investigates a Markov chain for uniform
generation of the DAGs in a Markov equivalence class. It finds a class of graphs on which the
associated Markov chain is slow mixing, but a condition for fast mixing as well. The key bar-
rier to fast mixing is large cliques with substantial intersection (roughly half). Section 3 gives a
structure theorem for understanding Markov equivalence classes in terms of posets and uses the
structure theorem to explore the ratio of DAG’s to Markov equivalence classes. Section 4 relates
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 3

the observations to a Markov chain for uniform generation of Markov equivalence classes by He,
Jia, and Yu [11]. In particular, this part provides a simpler, shorter proof of the fact that the
chain due to He et al is ergodic. This section concludes with the construction of sparse PDAG’s
with small Hamming distance but large distance in the chain using moves with positive probability.
This highlights the fact that analysis of convergence to equilibrium of the chain is unlikely to be
successful using straightforward canonical path arguments or coupling techniques.
The related problem of counting the number of DAGs in a Markov equivalence class has been
studied combinatorially in [7] and algorithmically in recent work of He and Yu [9]. The latter’s
algorithm is best suited to graphs with many vertices adjacent to all other vertices. Our Markov
chain is fast mixing on many graphs that lack this feature, but the simplest example (see Proposition
11) of a graph on which our Markov chain is slow mixing is well suited to their algorithm. There are
however, many graphs that are ill suited for both, such as chains of half overlapping large cliques.

2. Edge Flip Random Walk on Chordal Graphs

This section investigates the mixing time of a Markov chain designed to pick random samples
from the DAGs forming the equivalence class corresponding to a specific essential graph. This
involves choosing acyclic orientations for each of the undirected edges in the essential graph in such
a way as to form no v-configurations. (Recall that a v-configuration is an induced subgraph on
three vertices depicted in Figure 2(a).) This depends on only the undirected edges of the essential
graph and can be done for each connected and undirected component separately.
The Markov chain is then on v-configuration-free acyclic orientations of connected chordal undi-
rected graphs G = (V, E). A step in the Markov chain is to reverse the direction of a single edge so
as to give another such orientation. Let HG be the graph with vertices the orientations for G and
edges the transitions of the Markov chain. Note that HG is a |E|-regular graph, and the Markov
chain has uniform stationary distribution. When drawing HG , we will suppress self-loops.

Proposition 2. Any v-configuration-free acyclic orientation of a connected graph G has a unique


source.

Proof. Let G be a connected graph. Let G̃ be a v-configuration-free acyclic orientation of G.


Suppose there are two sources v and w in G̃. G is connected, so there exists a minimal length
undirected path from v to w, (v, v1 , v2 , ..., vk , w). Since v and w are sources, the edge between v
and v1 is directed v → v1 and the edge w to vk is directed vk ← w. There must be at least one
point on this path where right directed edges meet left directed edges, vi → vi+1 ← vi+2 . Since G̃
is v-configuration-free, there must be an edge vi to vi+2 in G. However, this gives a shorter path v
to w. Therefore, there can only be a single source in G. 
As a side note, the chordal condition for the undirected component of a PDAG exists because
there are only acyclic v-configuration-free orientations of chordal graphs. This means the lack of
chordal in the above proposition is not meaningful, as the result is vacuous for non-chordal graphs.

Proposition 3. The unique source of a v-configuration-free acyclic orientation of a connected


graph G determines the orientation of all edges with one endpoint closer to the source than the
other as away from the source.
4 MEGAN BERNSTEIN AND PRASAD TETALI

Proof. The proof follows by induction on the distance from the edge to the source. If the edge is
incident to the source, i.e. distance zero, then the edge must be oriented away from the source.
Suppose this holds for all edges at most distance d from the source. Let e = {v, w} be an edge
distance d + 1 from the source with v closer to the source. This means there must be an edge
f = {v, z} incident to d at distance k whose orientation is forced to be towards v. Moreover, there
cannot be an edge from z to w since v is closer to the source. If e were oriented towards v, then
z → v ← w forms a v-configuration. Since the orientation is v-configuration-free, e is oriented away
from the source. 

Example 4. If G = Kn , the v-configuration-free acyclic orientations are in bijection with the permu-
tations of n. This follows from Proposition 3 by successively choosing from the remaining vertices
a source which orients the edges adjacent to that source. Only the edges between two consecutive
choices of sources can be flipped, so this Markov chain is the adjacent transposition walk on the
symmetric group Sn . The adjacency graph HG is the Permutohedron.

Example 5. If G is the path, or indeed any tree, HG is isomorphic with G. Since there is a unique
path from any vertex to any other vertex, by Proposition 3, the states are entirely determined by
the unique source in the orientation. A move in the Markov chain moves the source to an adjacent
vertex.

These two cases are the extreme examples, and the structure for any G can be viewed by
decomposing into pieces of these types. The following structure theorem for chordal graphs is the
key (see Theorem 12.3.11 in Diestel [5]):

Proposition 6. A graph is chordal if and only if it has a tree-decomposition into complete parts.

Proposition 7. A tree-decomposition of a chordal graph is a tree T with vertices the maximal


cliques of G and for any vertices t1 , t2 , t3 ∈ T , if t2 is along the unique path from t1 to t3 , t1 ∩t3 ⊆ t2 .

Definition 8. A maximal clique t is a non-follower in an orientation of G when for all edges (v, w)
with w ∈ t, then v ∈ t as well.

The structure of the graph HG is closely related to the structure of the tree decomposition of
G. The following characterization of HG will be the key to analyzing the v-configuration free edge
flip random walk on G, or equivalently the random walk on HG . As the v-configuration free edge
flip walk on a clique is the random walk on a permutohedron, the key to understanding HG is to
describe how the permutohedra for the maximal cliques in the tree decomposition of G occur in
HG .
To get HG , first for each maximal clique ti , its graph Hti , a permutohedron, is dilated by
taking the Cartesian product with other permutohedra. Then these pieces are glued together by
identifying faces of the polytopes to form HG . Given a tree-decomposition T of G and a maximal
clique ti , for each other maximal clique tj let sj be the clique immediately before tj on the unique
Q
path from ti to tj . The dilation of ti will be Di = j6=i Htj ∩scj where the product is Cartesian
product. Then glue for ti and tj adjacent in T , Hti × Di and Htj × Dj along Hti ∩tj × Di,j where
Di,j = Hti ∩tcj × Di = Htj ∩tci × Dj .
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 5

G T HG
1423 1243
1
t
1 1 4123 2143

4 2 1 1
42 24
4 2 3 3
4 2
4321 2341
3 t
3 2

3421 3241

1 t 5
1 t 2 1425 124
1 2 5 3 3 3
52 14
4 2 5
4125 214
4 2 3 5 3
241 3
52 41
5 3
2 3 42
3 5 3
5
32 14
t
3
5
32 41
t 12 21
1 2 1 1 2 43
43

3 4 3 4 12 21
t 34 34
2

Figure 3. Three graphs with corresponding tree decompositions and adjacency graphs

Proposition 9. HG is formed by first making Hti × Di for each maximal clique ti in G. For each
pair of maximal cliques with non-empty intersection, their respective pieces are identified along the
faces Hti ∩tj × Di,j .

Proof. As shown above in Proposition 2, each acyclic, v-configuration-free orientation of G has a


unique source. This source as in Proposition 3 determines the orientation of all edges with one end
point closer to the source than the other. Fixing only a source and orienting these edges breaks down
the graph into disjoint components with independence on how to orient the disconnected pieces.
This will give rise to the decomposition. The maximal cliques containing the source remain in the
same component. To orient the remaining edges, sources are recursively chosen until all maximal
cliques are disconnected. The non-followers are the maximal cliques containing all recursive choices
of sources up to when the maximal cliques are disconnected. An orientation will be part of Hti × Di
when ti is a non-follower in the orientation. The gluing comes from when multiple cliques are non-
followers in the orientation. Hti ∩tj × Di,j are all the orientations from choosing the first |ti ∩ tj |
recursive sources in ti ∩ tj . 

Corollary 10. Let C(G) be the number of maximal cliques in G. For an orientation v ∈ HG ,
let M (v) be the number of non-following cliques in v. The degree in HG of an orientation is
|G| − C(G) + M (v) − 1. The minimal degree is |G| − C(G).

Proof. The degree of the vertices in a component for a clique Hti is |ti | − 1, the number of adjacent
transpositions on S|ti | . Taking cartesian products adds the degree of the graphs involved, so the
degree of a vertex in Hti × Di is |G| − C(G). Its left to count how many edges extend outside of
Hti × Di . These components are glued together along the pairs ti , tj with ti adjacent to tj in T
6 MEGAN BERNSTEIN AND PRASAD TETALI

and ti ∩ tj 6= ∅. The number of overlapping edges between the pieces glued together is the degree
of a vertex in Hti ∩tj × Di,j , namely |G| − C(G) − 1. Each gluing thus increases the degree of the
orientations involved by one. The number of gluings involving an orientation is one less than the
number of non-following cliques.


Proposition 11. There exists graphs G for which the edge flip random walk is mixing exponentially
slowly. For instance, when G is made up with two cliques of size 2n n
3 sharing 3 vertices, the mixing
time tmix ≥ 4n−1 .

Proof. G has two maximal cliques t1 and t2 , each of size 2n


3 , with intersection s of size n/3.
For each maximal clique, its component is Ci = HK 2n × HK n , identified along the face F =
3 3
HK n ×HK n ×HK n , following our construction of HG above. The face F corresponds to orientations
3 3 3
where both maximal cliques are non-followers. Let R be equal to one of these components C1 /F
without the intersection, i.e. orientations when one clique is the leader. Let Q(A, B) be the chance
of moving from A to B in one step of the random walk starting from the uniform stationary
distribution, π. The bottleneck ratio of the random walk on HG is:
Q(R, RC )
Φ∗ ≤ Φ(R) = .
π(S)
The Markov chain is the edge flip random walk on G, so each state has |E| possible moves. The
probability of moving from R to RC is the total number of edges from R to RC in HG over |E| · |S|.
Each orientation of G in F comes from a recursive choice of sources where the first n/3 sources
are from s, the intersection of the maximal cliques, followed by an independent recursive choice of
sources in t1 \ s and t2 \ s. From such an orientation, there is a single edge into R corresponding
to flipping the edge between the source chosen last in s and first in t1 \ s. Therefore, the number
of edges from R to RC is |F | = (n/3)!3 . The probability of S under the stationary distribution
is |S| = (2n/3)!(n/3)! − (n/3)!3 over the number of orientations of G. By inclusion-exclusion, the
number of orientations is |C1 | + |C2 | − |F | = 2(2n/3)!(n/3)! − (n/3)!3 .

Q(R, RC ) 1 (n/3)!3 2(2n/3)!(n/3)! − (n/3)!3 1 1


= = .
π(S) |E| (2n/3)!(n/3)! (2n/3)!(n/3)! − (n/3)!3 |E| 2n/3 − 1

n/3
2n/3
Using Stirling’s approximation, n/3 ≈ √πn 4 . For n ≥ 4, 4 ≤ n/π4 − n, and Φ(S) ≤ 4−n .
1 n n
p n

By Theorem 7.3 of [13], we have tmix ≥ 41 Φ. This means for this G, tmix ≥ 4n−1 . 

We will use a decomposition theorem of Madras and Randall (see Theorem 1.1 in [15]) to get an
upper bound on the mixing time. Given a reversible Markov chain P on Ω = m
S
i=1 Ai with stationary
distribution π, they construct two types of Markov chains: for each i, a (restriction) Markov chain
restricted to each subset Ai , and a (projection) Markov chain on m states, a1 , a2 , . . . , am , with
ai representing the set Ai . The mixing time of these auxiliary chains are used to get a bound
on the mixing time for the original chain. Let the Markov chain inside each set be P[Ai ] (x, B) =
P (x, B) + 1x∈B P (x, Aci ) for x ∈ Ai and B ⊂ Ai . To define the chain over the subsets, let Θ :=
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 7

maxx∈Ω |{i : x ∈ Ai }|. Define the chain over the covering to be:
π(Ai ∩ Aj )
PH (ai , aj ) := .
Θπ[Ai ]

Theorem 12 (Madras-Randall). In the preceeding framework, we have


 
1
Gap(P ) ≥ 2 Gap(PH ) min Gap(P[Ai ] ) .
Θ i=1,...,m

An upper bound on the mixing time of this random walk can be found through an upper bound
on the mixing of the Markov chain for each maximal clique and for a constructed biased random
walk on the tree from the tree decomposition. The Markov chains for each maximal clique are the
adjacent transposition walk, which is well understood. The random walk on the tree will be studied
with comparison techniques.
The only impediment to rapid mixing is a quantity relating to the overlap between cliques along
a path in the tree decomposition. Let
P
i |ti |!|Di |
oG := .
min(j,k)∈T |tj ∩ tk |!|Dj,k |

Theorem 13. For a graph G on n vertices, let oG be defined as above. Let T be a tree decomposition
with Θ = deg(T ) being the maximal number of overlapping maximal cliques in G. Let tmax =
maxi |ti |. Then the spectral gap of the random walk on HG satisfies:
−1
Gap(HG ) ≥ oG Θ3 (|G| − |T |) diam(T ) 2 (1 − cos(π/tmax )) .

Proof. Using the decomposition theorem of Madras and Randall [15], an upper bound on the mixing
time of the random walk on HG can be obtained by understanding the mixing of a random walk
PAi on each component Hti × Di and the mixing of a Markov chain on the tree from the tree
decomposition of G into maximal cliques.
The random walk on the pieces Hti × Di are Cartesian products of the adjacent transposition
walk on the symmetric group. Random walk on the Cartesian product is the product chain on the
components. The spectral gap for the product Γ̃ = Γ1 × ... × Γd with the chance of moving in the
jth chain being wj and the spectral gap of the jth chain being γj is:

γ̃ = min wj γj .
j
2
Here the spectral gap of the random walk on HKi is i−1 (1 − cos(π/n)) by a result of Bacher [2].
Note that the degree of the vertices in HKi (or Cartesian products thereof) is the same as the
degree of Ki (or Cartesian products thereof), namely i − 1. Therefore, the chance of making a
move in a component of the product chain is the size of that clique minus one over the number of
vertices minus the number of cliques. This gives that γ̃ is at most:
2
Gap(PAi ) = γ̃ ≤ (1 − cos(π/tmax )) .
|G| − |T |
The Markov chain on the tree decomposition into cliques has transition probabilities constructed
as follows. Let Θ denote the maximum degree of the tree. For two cliques ti , tj adjacent in the
8 MEGAN BERNSTEIN AND PRASAD TETALI

tree,
−1
π(Hti ∩tj × Di,j )
 
|ti |
PT (ti , tj ) = =Θ .
Θ π(Hti × Di ) |ti ∩ tj |
It has stationary distribution π(ti ) = |ti |!|Di |z −1 where z = i |ti |!|Di |.
P

The spectral gap of this Markov chain on the maximal cliques will be bounded using comparison
to a Markov chain on the complete graph with vertices the maximal cliques with the same stationary
distribution. For all i, j let P˜T (ti , tj ) = π(j). This Markov chain mixes in one step.
The comparison technique of Diaconis-Saloff-Coste [4], is in terms of A below, where γk,l is the
unique path in T from tk to tl :

1 X
A = max |γk,l |π(tk )π(tl ).
(ti ,tj )∈T PT (ti , tj )π(ti )
k,l:(ti ,tk )∈γtk ,tl
P P
To simplify this, note |γk,l | ≤ diam(T ), k π(tk ) = 1 and hence k,l |γk,j |π(tk )π(tl ) ≤ diam(T ).
|D ||t ∩t |! 1
Additionally, PT (ti , tj )π(ti ) = P i,j i j =
( oG Θ . Therefore,
i |ti |!|Di |)Θ

A ≤ oG Θ diam(T ).
1
This gives for the largest non-trivial eigenvalue β1 of PT , β1 ≤ 1 − A, and
1
Gap(PT ) ≥ ≥ (oG Θ diam(T ))−1 .
A
1
The result [15] gives Gap(P ) ≥ Θ2
Gap(PH ) mini Gap(PAi ), which from the bounds above gives:

Gap(P ) ≥ (oG Θ3 (|G| − |T |) diam(T ))−1 2 (1 − cos(π/tmax )) .




Corollary 14. When all the maximal cliques in G are the same size t and all intersecting cliques
intersect along s vertices, then oG = |T | st and the spectral gap satisfies:


 
t 3
Gap(P ) ≥ |T |(|G| − |T |)Θ 2 (1 − cos(π/t)) .
s
The extreme cases of G being a complete graph or a tree fall into the purview of this corollary,
and the bound on the spectral gap is within a constant of what it should be.

3. Posets and Essential Graphs

A poset, or partially ordered set, is a combinatorial object defined on a set X with a set of
relations P . The relations form a partial order in that they are reflexive (for all x ∈ X, x ≤ x),
antisymmetric (not both x ≤ y and y ≤ x if y 6= x), and transitive (if x ≤ y, y ≤ z, then x ≤ z). A
few terms from posets are useful in the looking at essential graphs. We say y covers x if x < y with
no z so that x < z < y. A chain in a poset is a set C ⊆ X with all elements of C totally ordered
by the poset. The height of a point in a poset is the size of the largest chain with that point as the
highest point in that total order.
Every DAG can be reduced to a poset by establishing the relations v < w if w is reachable from
v. For each labeled poset P , it is straightforward to count the number of essential graphs with
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 9

only directed edges that reduce to it as well as the number of DAGs. We recall here that a relation
x < y in a poset P is called a cover relation, if there is no z such that x < z < y in P .

Proposition 15. The number of DAGs is


XY
(2d(v)−c(v) ),
P v∈P

where the sum is over labeled posets, d(v) is the number of poset elements below v in the order, and
c(v) is the number of elements covered by v.

Proof. One can construct the DAGs that have a specific reachability poset by looking at the down
set and cover relations of each point of the poset. The points covered by a point v are obliged to
have directed edges to v. All other elements of the descent set have the option to have a directed
edge to v. 

Surprisingly, satisfying the three conditions to protect the edges is straightforward in this setting.

Proposition 16. The number of Markov equivalence classes of size one is:
XY
(2d(v)−c(v) − 1c(v)=1 ),
P v∈P

where the sum is over labeled posets, d(v) is the size of the down set of v, and c(v) is the number
of elements covered by v.

Proof. From a labeled poset P , we will count all the essential graphs with only essential edges that
reduce to it. The DAG must include all the cover relations in P . For each point v in the poset, a
choice can be made to include or not a directed arrow to v from any u in the down set of v that
is not covered by v. If d(v) is the size of the down set of v, and c(v) is the number of elements
covered by v, this gives 2d(v)−c(v) ways to have edges come into v. These edges are protected as
long as the condition {w|w → u} = 6 {w|w 6= u, w → v} is met. If v covers at least two other points,
then no other vertex coming from the down set of v can have an incoming edge from both these
points, since they must be incomparable in the poset. If v only covers a single point, then there is
a unique way for {w|w → u} 6= {w|w 6= u, w → v} to fail: If u is the point that v covers and they
have edges from exactly the same vertices. This gives 2d(v)−c(v) − 1c(v)=1 ways to have protected
directed edges coming into v. 

3.1. Exact enumeration and discussion. The ratio of essential graphs to DAGs and essential
graphs with only directed edges to DAGs is of interest to determine the limit of increased efficiency
in working with essential graphs versus DAGs. Using essentially the same observation above,
Steinsky [19] uses inclusion-exclusion to get a recursive formula for an , the number of essential
graphs with only directed edges, otherwise known as essential DAGs. The inclusion-exclusion
works to add a set of i maximal vertices and connect ni lower vertices to them arriving at the
formula,
n  
i+1 n
X i
an = (1) 2n−i − (n − i) an−i .
i
i=1
10 MEGAN BERNSTEIN AND PRASAD TETALI

This is done in the style of Robinson’s recursive formula [18] for the number of DAGs,
n  
i+1 n
X
0
an = (1) 2i(n−i) a0n−i .
i
i=1

Steinsky has a second paper published in 2013 on enumerating Markov equivalence classes [20].
While they are both exceedingly useful for computation, the alternating quality of both formulas
makes asymptotic analysis a challenge. While the poset construction of a formula for an would be
next to useless for computing an for large n, its all positive structure makes it more amenable to
inform the ratio of a0n /an .
Steinsky computed that the ratio for n ≤ 300. By n = 200, the first 45 decimal places appear to
have stabilized, so a0n /an = 13.6517978587767...
The q-Pochammer symbol appears to be playing a role in this constant, as shown below by
looking at certain families of posets. The ratio of the q-Pochammer symbol that appears and
Steinsky’s approximations, a0n /an 12 , 21 n−2 , is just under 4 at 3.94...


The largest contribution of an unlabeled poset to the formulas above is for the total orders,
where there are n! ways to label the poset. This family also has the largest number of DAGs per
n−1
reachability poset at ni=2 2i−2 = 2( 2 ) , since this maximizes the down set and minimizes the
Q

covered points. However, none of these DAGs form an essential DAG, since none of these edges is
protected. One of the next largest contributions to the DAG formula is from the unlabeled poset
that form a total order aside from two incomparable elements at the bottom of a linear order.
There are n!/2 ways to label this poset and ni=4 2i−2 ways to construct a DAG from it. Now, a
Q

large number of these are essential graphs. Specifically, ni=4 (2i−2 − 1) of them. This proportion
Q

is:

n n  i !
2i−2 − 1
 
Y Y 1 1 1
= 1− =2 , ,
2i−2 2 2 2 n−2
i=4 i=4
where (a, q)n = n−1 i
Q
i=0 (1 − aq ) is the q-Pochammer symbol.
This ratio is quite a bit off from the ratio of essential graphs with only directed edges to DAGs,
but by combining the two classes together, one gets a lot closer.
There are four times as many DAGs from linear orders as the DAGs from our almost linear
orders. This gives instead of 2 12 , 21 n−2 as a ratio, 25 12 , 12 n−2 . Steinsky’s computations would
 
1
match up with a leading fraction of ≈ 3.94 .
There should be a natural way to link up posets that lead to essential graphs and those that do
not that results in ratios that closely approximate the real ratio. For instance, all essential graphs
on only directed edges have in their reachability poset at least two points of height 1 covered by
each point of height 2 (and all points of height 2 cover at least two points height 1). By taking any
pair of these height 1 points and covering one with the other, one arrives at a poset not giving an
essential graph. Moreover, for a poset with height 2 points not covering at least two height 1 points,
there is a clear map to essential graph posets - make those two points incomparable. Moreover,
since the q-Pochammer symbol that appears will change as the posets get wider, the constant may
be easier to achieve than it first appears.
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 11

To extend this work to understanding the ratio of DAGs to all essential graphs, one needs to
understand how undirected edges can be added as well as the directed edges. The undirected edges
are not as independently placeable as the directed edges due to the restriction that they must form
a chordal graph. Work has been done on counting the size of the equivalence class given a fixed
set of undirected edges in [7, 9] and sampling inside the equivalence class in Section 2. This is an
attractive question in that it depends only on the undirected edges present and not on any directed
edges. It may also be beneficial to study how the undirected edges can be placed on top of the
directed edges in the poset model.
Undirected edges can be added to an essential graph with only directed edges as follows. Edges
are only allowed between vertices u, v with directed edges coming in from exactly the same other
vertices. This prevents the disallowed induced subgraphs of Figure 1. Care must also be taken
to keep the edges coming out of these vertices protected. Using protected edge condition Figure
2 (d), the edges coming into v are protected after the addition of undirected edges if at least two
of the vertices with edges to v have no undirected edge between them. On top of these protection
conditions, the undirected edges must also be chordal. It seems like one of the alternate definitions
of chordal might be more tractable. For instance, this definition seems more naturally suited to
recursive constructions: a graph is chordal if its vertices are partitionable into three sets A, C, B
with the induced subgraph on C complete and no edges with one end in A and the other in B.

4. A Markov Chain on Essential Graphs

Essential graphs are difficult to count or generate exactly beyond about 20 vertices. He, Jia, and
Yu [11] created a reversible Markov chain on essential graphs that can be used to sample uniformly
from the essential graphs on a large number of vertices. It is designed to function under a sparsity
condition of at most constant density. In order to produce a non-lazy chain, the authors choose to
only allow moves that give a different essential graph. This has the effect of giving a non-uniform
stationary distribution and making the chain harder to describe. Here, a lazy version of their chain
is used in order to make formulas more explicit.
He, Jia, and Yu ultilize 6 moves on the state space of essential graphs on n vertices. The first
four are changes in one edge. A random pair of vertices are selected uniformly at random. Then a
choice is made to attempt to add a directed edge, remove a directed edge, add an undirected edge,
or remove an undirected edge. Adding an edge is only allowed if an edge is not already present.
Removing an edge is similarly only allowed if that type of edge is already present. Furthermore,
the resulting graph need not be an essential graph itself, but must be extendable by directing edges
to give a DAG. The graph is modified to be the essential graph of that DAG. A move to add an
edge is furthermore only permissible if that edge remains the type added in the essential graph.
These additional factors of correcting the graph to be essential are done in order to make the chain
reversible. The final two moves are to pick three vertices a, b, c and add or remove an immoral. An
immoral can only be added if there are undirected edges a − b and b − c with a and c not adjacent.
As before, the graph is then corrected to be an essential graph if possible or the move is rejected.
To remove an immoral at b, there must be directed edges a → b and c → b with no edge from a to
12 MEGAN BERNSTEIN AND PRASAD TETALI

c. This is changed by the move to undirect the edges, again with the restriction that it repairs to
an essential graph with those edges still undirected.
He, Jia, and Yu show the chain is connected by showing an iterative procedure of removing all
directed edges followed by removing some directed edge or a v-configuration, thus prescribing a
path in the chain from any essential graph to the empty graph. The poset relationship described
above can be used to give an explicit order in which edges can be removed in order to get to the
empty graph. Such a procedure could be of use in a mixing time analysis of the chain.
• Remove all undirected edges.
• Starting with the maximal elements at height ≥ 2 of the poset for the essential graph,
remove all directed edges into the maximal elements one at a time in an order that keeps
them from having the same incoming edges as their children. If a maximal element has
multiple children this can be done by removing the edges from the children last (removing
any children at height 1 first). If there is a single child, remove all incoming edges in
common with that child first. Recursively continue with the new maximal elements.
• Continue this until only immorals consisting of elements at height 1 and 2 are left. Remove
these by first removing all possible directed edges as singletons, then for each immoral,
turning the immorals into undirected edges and removing those.

Proposition 17. Every move in the above procedure is a move in the Markov chain proposed by
He et al [11].

Proof. Of the four conditions for an essential graph, removing undirected edges could only violate
the need for the undirected edges to form a chordal graph. Several of the alternate definitions of
chordal graphs give that there is an order remove all the undirected edges leaving it chordal at each
step. For instance, one definition of chordal is that every chordal graph can be broken up into three
sets A, C, B with C non-empty and complete, no edges between A and B, and A and B chordal.
If A and B are independent, removing edges A to C and B to C, followed by the edges adjacent
to each vertex in C leaves the graph chordal at each step. Recursively removing the edges from A
and B, then A,B to C, then inside of C, then gives a way to remove all edges leaving the graph
chordal at each step.
For an essential graph with only directed edges, by Proposition 1 an edge u → v is protected if
{c|c → u} = 6 {c|c 6= u, c → v}. For a maximal element of the poset v, there are no edges coming
out of v, so removing edges coming into v does not endanger the protectedness of any edges going
into other vertices. Its enough to ensure that the edges coming into v are protected at each step
of their removal. Let w1 , ...wr be the vertices covered by v in the poset. Each of these is in the set
{c|c 6= u, c → v} when they aren’t u but are never in {c|c → u}. As long as there are at least two
vertices covered by v, all edges into v are protected and all but those two can be removed in any
order. Leave one of the edges from the highest vertex, wi , v covers. Since v is height at least 3, wi
is height at least 2, and {c|c → wi } 6= ∅. Therefore the edge wi → v is protected if it is the last
edge to v left. Remove the other edge, then this edge.
Suppose instead that v covers a single vertex w, with height of v at least 3. Since the edge w → v
is protected, {c|c → w} = 6 {c|c 6= w, c → v}. The presence of a vertex in the latter set not in the
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 13

former would mean v covered at least two vertices. Then the first set contains at least one vertex u
not in the other set. Remove all edges to v other than w → v in any order, as this edge will remain
protected by the presence of u and the presence of w → v protects them while they exist. Finally,
remove w → v.
After recursively removing all maximal vertices of height at least 3, the essential graph left has
only vertices of height 1 with directed edges leading to vertices of height 2. Each vertex of height
two has at least two incoming edges since forming an immoral is the only way such edges can be
protected. Prune each immoral down to two edges by removing single edges. Take one immoral
a → b ← c with no other edges going to b. Then turning it into a − b − c forms an essential graph
because no u → v − w edges were formed, and all other directed edges remain protected in an
immoral. Then these undirected edges can be removed since they are the only undirected edges in
the graph. Repeat with the other immorals. 

Define the Hamming distance between two essential graphs to be the number of edges including
direction, in which the graphs differ. We can show that using the basic four moves, at most two
moves are needed to move between any two essential graphs at Hamming distance one. The chain
is not connected by just these moves. The Hamming distance between two essential graphs that
are adding or removing an immoral (without changing the direction of other edges), a → b ← c
versus a − b − c, however suffice by the above procedure. Note, this should also mean the chain
would be connected and reversible, if the “repairing” moves that do not give essential graphs were
left out.

Proposition 18. There is a path in the chain of length at most 2 between any two essential graphs
at Hamming distance one.

Proof. The first statement breaks into three cases. Let a, b be the two vertices at which the graphs
differ. Case I is that one graph has no edge a to b while the other has a directed edge, WLOG,
a → b. Case II is that one graph has no edge and the other has an undirected edge a − b. Case III
is that one graph has a directed edge a → b and the other b → a. It is not possible to differ at a
directed versus undirected edge, since by definition an edge in an essential graph is undirected if
there is a DAG in which that arrow goes in either direction.
Case I: Since both no edge and the directed edge form essential graphs, the Markov chain moves
of add a directed edge and removing a directed edge are legal, and this gives a one step path in
either direction under the chain.
Case II: Similarly to Case I, since both no edge and the undirected edge form essential graphs,
the Markov chain moves of adding an undirected edge and removing an undirected edge give a one
step path between the two essential graphs.
Case III: We will show removing the directed edge and adding it back in the other direction gives
a two step path in the chain. This case has the complexity since it is necessary to show that the
intermediate move gives an essential graph only differing at that edge. The graph G obtained by
removing the directed edge must still be chordal and contain no partially directed cycles since no
undirected edges were changed and removing edges can not add a cycle. The third criterion for an
essential graph, no x → y − z, could also not be introduced by removing an edge. It is left to show
14 MEGAN BERNSTEIN AND PRASAD TETALI

In In

c c
In In In In

a b a b

In In In In
d d
Figure 4. Hamming distance two counter example

all the directed edges remain protected. Suppose a → b is used as one of the edges in one of the
four induced subgraphs that protects that edge u → v. The fourth induced subgraph cannot be
the relevant one, since switching the direction of any of the directed edges gives a partially directed
cycle. That means u → v is protected by of the first three conditions which by, Proposition 1, are
equivalent to {c|c → u} = 6 {c|c 6= u, c → v}. The edges a to b are only relevant if one of u or v is
a or b. Without loss of generality we will check u = a and v = a. Suppose u = a and v 6= b. In
the graph with a → b, the edge is not in either set and the sets are still not equal. That means
with no edge a to b the sets are disjoint and the edge is protected. Suppose v = a and u 6= b. In
the graph with b → a, the edge is not counted in either set and the sets are still not equal. That
means without the edge a to b, the edge u to v is still protected.


Unfortunately, the Markov chain does not give short paths between all essential graphs with
Hamming distance two. For instance, the graph in figure 4, where In stands in for an independent
graph on n vertices.
First note, neither the graph with both a − b and c − d and neither a − b not c − d is an essential
graph. The first because in order to protect the directed edges, there must be two non-adjacent
vertices. The second, because then the undirected edges would have a four-cycle and fail to be
chordal. There are two general approaches to get around this, either add more directed edges going
up to the independent graph or delete/add undirected edges. In order to protect all the directed
edges with both a − b and c − d present, an extra directed edge would have to be added going to
each for the vertices in the independent set. This requires at least n edges. Alternatively, one can
try to avoid breaking the chordal condition while manipulating undirected edges. This could be
done by either deleting an edge in the cycle formed by a, c, b, d or adding extra chords in the cycles
formed between the independent graphs and a, b, c, d. In order to delete the edge a − c, one first
has to delete n edges from a or c to the independent graph to avoid forming a four-cycle. This
takes order n moves. In adding chords to avoid making a cycle, a chord has to be added to each
A NOTE ON SAMPLING GRAPHICAL MARKOV MODELS 15

vertex in an independent graph, so again order n new edges must be added. Together, this means
there is no path between these graphs in o(n) moves of this Markov chain.
Moreover, this example has 5n + 4 vertices and 12n + 5 edges, well inside the sparsity condition
He, Jia, and Yu consider. Namely, that the number of edges being at most a small constant multiple
of the number of vertices.
Acknowledgment. We thank Caroline Uhler and Liam Solus for several helpful discussions on the
topic of enumeration of Markov equivalence graphs.

References
[1] Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of markov equivalence classes
for acyclic digraphs. Ann. Statist., 25(2):505–541, 04 1997.
[2] Roland Bacher. Valeur propre minimale du laplacien de Coxeter pour le groupe symétrique. J. Algebra,
167(2):460–472, 1994.
[3] David Maxwell Chickering. Learning equivalence classes of Bayesian-network structures. J. Mach. Learn. Res.,
2(3):445–498, 2002.
[4] Persi Diaconis and Laurent Saloff-Coste. Comparison techniques for random walk on finite groups. Ann. Probab.,
21(4):2131–2156, 1993.
[5] Reinhard Diestel. Graph theory, volume 173 of Graduate Texts in Mathematics. Springer, Heidelberg, fourth
edition, 2010.
[6] Nir Friedman, Michal Linial, Iftach Nachman, , and Dana Pe’er. Using bayesian networks to analyze expression
data. Journal of Computational Biology, 7:601–620, 2004.
[7] Steven B. Gillispie. Formulas for counting acyclic digraph Markov equivalence classes. J. Statist. Plann. Inference,
136(4):1410–1432, 2006.
[8] Steven B. Gillispie and Michael D. Perlman. The size distribution for Markov equivalence classes of acyclic
digraph models. Artificial Intelligence, 141(1-2):137–155, 2002.
[9] Y. He and B. Yu. Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs.
ArXiv e-prints, October 2016.
[10] Yang-Bo He and Zhi Geng. Active learning of causal networks with intervention experiments and optimal designs.
J. Mach. Learn. Res., 9:2523–2547, 2008.
[11] Yangbo He, Jinzhu Jia, and Bin Yu. Reversible MCMC on Markov equivalence classes of sparse directed acyclic
graphs. Ann. Statist., 41(4):1742–1779, 2013.
[12] David Heckerman, Dan Geiger, and David M. Chickering. Learning bayesian networks: The combination of
knowledge and statistical data. Mach. Learn., 20(3):197–243, September 1995.
[13] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical
Society, Providence, RI, 2009. With a chapter by James G. Propp and David B. Wilson.
[14] Marloes H. Maathuis, Markus Kalisch, and Peter Bühlmann. Estimating high-dimensional intervention effects
from observational data. Ann. Statist., 37(6A):3133–3164, 2009.
[15] Neal Madras and Dana Randall. Markov chain decomposition for convergence rate analysis. Ann. Appl. Probab.,
12(2):581–606, 2002.
[16] A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov Equivalence Classes by Number of Immoralities.
ArXiv e-prints, November 2016.
[17] A. Radhakrishnan, L. Solus, and C. Uhler. Counting Markov Equivalence Classes for DAG models on Trees.
ArXiv e-prints, June 2017.
[18] R. W. Robinson. Counting unlabeled acyclic digraphs, pages 28–43. Springer Berlin Heidelberg, Berlin, Heidel-
berg, 1977.
[19] Bertran Steinsky. Enumeration of labelled chain graphs and labelled essential directed acyclic graphs. Discrete
Math., 270(1-3):267–278, 2003.
16 MEGAN BERNSTEIN AND PRASAD TETALI

[20] Bertran Steinsky. Enumeration of labelled essential graphs. Ars Combin., 111:485–494, 2013.

You might also like