### Bayesian Networks

```Bayesian Networks
Overview
1.
2.
3.
4.
5.
Introduction to Bayesian Networks
Inference
Learning Parameters
Learning Topology
Decision Making
BAYESIAN NETWORKS:
AN INTRODUCTION
Overview
1.
2.
3.
4.
5.
A Bayesian Network
Instantiation
Probability Flows
Bayesian Networks and Causality
Edges and Conditional Independence
Bayesian Networks: Introduction
A Bayesian network consists of two parts:
1. The qualitative, which is given in the form of a directed
acyclic graph (DAG).
•
•
Each node of the graph represents a variable of the system the
Bayesian Network is modeling.
The edges of the graph represent independency relations
between the variables (more on this later)
2. The qualitative, which is given by probability distributions
associated with each variable, which is to say with each
node in the graph. (From now on I will talk
•
These distributions give the probability that the associated
variable takes a particular value given the values of its parent
nodes in the graph.
1
0
.7
.3
1
0
.2
.8
1
0
.6
.4
B
A
B C 1
0
1 1 .9
.1
1 0 .8
.2
0 1 .6
.4
0 0 .1
.9
C
D
E
B
C
E
1
0
1
1
1
.5
.5
1
1
0
.4
.6
A 1
0
1 .3
.7
1
0
1
.7
.3
0 .9
.1
1
0
0
.4
.6
0
1
1
.3
.7
0
1
0
.5
.5
0
0
1
.6
.4
0
0
0
.1
.9
F
Instantiation
• We will talk about nodes being ‘instantiates’
when we know that they have a particular
value, and uninstantiated when we do not.
• Let’s look at what occurs when a node is
instantiated…
Probability Flows
• Downwards:
Pr(D=1)
=(.7)(.3)+(.3)(.9)
=.21+.27
=.48
1
0
.7
.3
A
A 1
0
1 .3
.7
0 .9
.1
D
Probability Flows
• Downwards:
Let A=1.
Pr(D=1)
A
=.3
A 1
0
1 .3
.7
0 .9
.1
D
Probability Flows
• Upwards:
Pr(A=1)
1
0
.7
.3
A
=.7
A 1
0
1 .3
.7
0 .9
.1
D
Probability Flows
• Upwards:
Let D=1.
Pr(A=1)
1
0
.7
.3
A
=(.7)(.3)/((.7)(.3)+(.3)(.9))
=.21/(.21+.27)
A 1
0
1 .3
.7
=.4375
0 .9
.1
D
Probability Flows
• Sideways!
• A Priori:
1
0
.2
.8
B
C
1
0
.6
.4
Pr(B=1) =.2
• If C is instantiated
and E is not, this is
unchanged:
Let C=0
Pr(B=1) =.2
E
B C 1
0
1 1 .9
.1
1 0 .8
.2
0 1 .6
.4
0 0 .1
.9
Probability Flows
• Sideways!
1
0
.2
.8
B
• But if E is instantiated, then
knowing the value of C affects
our knowledge of B…
C
E
1
0
.6
.4
B C 1
0
1 1 .9
.1
1 0 .8
.2
0 1 .6
.4
0 0 .1
.9
Probability Flows
• Sideways!
1
0
.2
.8
B
C
1
0
.6
.4
• Let E=1
Pr(B=1)
=(.2)(.6)(.9)+(.2)(.4)(.8)/
((.2)(.6)(.9)+(.2)(.4)(.8)+
(.8)(.6)(.6)+(.8)(.4)(.1))
=.172/.588
=.293
E
B C 1
0
1 1 .9
.1
1 0 .8
.2
0 1 .6
.4
0 0 .1
.9
Probability Flows
• Sideways!
1
0
.2
.8
B
C
1
0
.6
.4
• Let E=1,C=0
Pr(B=1) =(.2)(.4)(.8) /
(.2)(.4)(.8)+(.8)(.4)(.1)
=.064/.096
=.666
E
B C 1
0
1 1 .9
.1
1 0 .8
.2
0 1 .6
.4
0 0 .1
.9
Probability Flows
• What is going on?
• Sideways inference is akin to ‘explaining away’
Hypothesis 1:
Mr N.N. suffered a
stroke.
Hypothesis 2:
attack.
Event:
Mr N.N. died.
Probability Flows
• What is going on?
• Sideways inference is akin to ‘explaining away’
Hypothesis 1:
Mr N.N. suffered a
stroke.
Hypothesis 2:
attack.
Event:
Mr N.N. died.
Bayesian Networks and Causality
• So edges represent causation?
NO!!!
Bayesian Networks are not (in general)
causal maps.
– When creating a BN from expert knowledge, the network is
often constructed from known causal connections, since
humans tend to think in terms of causal relations.
– When learning a BN from data we cannot assume that an edge
represents a causal relationship.
• There have been controversial methodologies suggested for reading
causal relationships of non-causal Bayesian Networks. Cf Neopolitan
Edges and Conditional Independence
• We said ‘The graph represents independency
relations between the variables.’
• It does so through the edges, or, more
accurately, through the ABSENCE of edges,
between nodes.
– Recall that two variables, A and B, are independent if:
P(A,B)=P(A).P(B)
– And they are conditionally independent of a variable C if:
P(A,B|C)=P(A|C).P(B|C)
BAYESIAN NETWORKS:
TECHNICALITIES AND DEFINITIONS
Overview
1.
2.
3.
4.
The Markov Condition
D-Separation
The Markov Blanket
Markov Equivalence
The Markov Condition
The Markov Condition:
A node in a Bayesian Network is conditionally
independent of its non-descendents given its
parents.
Be careful:
– Does the sideways flow of probabilities clash with what
you think the Markov Condition claims?
– A node is NOT conditionally independent of its nondescendents given its parents AND its descendents!
The Markov Condition
• We can think of a Bayesian Network as ‘pulling apart’ a
Joint Probability Distribution by its conditional
independencies, and thereby rendering it tractable.
• This permits us to use probability theory to reason
about systems in a tractable way.
– Imagine each variable in our six node network can take ten
values. The conditional probability tables would then have
a total of 11,130 values. (10,000 of which would be for
node F).
– The full joint probability table would have 1,000,000
values.
BUT THEY REPRESENT THE SAME DISTRIBUTION
The Markov Condition
A BN can do this only because it meets the
Markov Condition.
In fact, meeting this conditional is the formal
definition of a Bayesian Network:
A DAG G, and a Probability distribution P is a
Bayesian Network if and only if the pair <G,P>
together satisfy the Markov Condition.
The Markov Condition also entails further
conditional independencies…
D-Separation
Some definitions:
Where we have a set of nodes {X1,X2, . . . .,Xk},
where k ≥ 2, such that Xi -> Xi-1 or Xi-1 -> Xi, for 2
≤ i ≤ k, we call the set of edges connecting these
nodes a chain between X1 and Xk.
Let the head of an edge be the side next to the
child (where the arrow is on our graphs) and the
tail be the side next to the parent. We will talk of
the edges of a chain meeting at a node on the
chain.
D-Separation
Some definitions:
Let A be a set of nodes, and X and Y be distinct
nodes not in A, and c be a chain between X and Y.
Then c is blocked by A if one of the following
holds:
1. There is a node Z ∈ A on c, and the edges that meet
at Z on c meet head-to-tail.
2. There is a node Z ∈ A on c, and the edges that meet
at Z on c meet tail-to-tail.
3. There is a node Z on c, such that Z and all of Z’s
descendents are not in A, and the edges that meet
D-Separation
D-Separation
Let A be a set of nodes, and X and Y be distinct
nodes not in A. X and Y are d-separated by A if
and only if every chain between X and Y is
blocked by A.
(This can be generalized: Let G = (V, E) be a DAG, and A, B, and C be
mutually disjoint subsets of V. We say A and B are d-separated by C in G if
and only if, for every X ∈ A and Y ∈ B, X and Y are d-separated by C.)
D-Separation
1) The Markov condition entails that all dseparations are conditional independencies;
2) Every conditional independencies entailed by
the Markov condition is identified by a dseparation.
The Markov Blanket
The Markov Blanket
A node is conditionally independent of every
other node in the graph given its parents, its
children, and the other parents of its children.
These form the Markov Blanket of the node.
• Why the other parents of the nodes children?
• Because of the sideways flow of probabilities.
The Markov Blanket
Markov Equivalence
A definition:
If edges from nodes A and B meet at a node C,
we say that this meeting is coupled if and only
there is also an edge between nodes A and B.
Otherwise this meeting is uncoupled.
Markov Equivalence
Markov Equivalence
Two DAGs are Markov Equivalent:
⇔They encode the same conditional
independencies;
⇔They entail the same d-separations;
⇔They have the same links (edges without regard
for direction) and the same set of uncoupled
INFERENCE 1:
THE VARIABLE ELIMINATION
ALGORITHM
Potentials
We define a potential to be a function mapping value
combinations of a set of variables to the non-negative
real numbers. For example f is a potential:
f (A,D) =
a d f(A=a,D=d)
1 1
.3
1 0
.7
0 1
.9
0 0
.1
Notice this is the conditional probability table of node
D.
• Conditional Probability tables are potentials.
• Joint Probability tables are potentials.
Potentials
• A potential’s ‘input’ variables (those variables
which can take more than one value) are called
its scheme; here Scheme(f) = {A,D}.
• So, if we know D is true, then the potential
corresponding to our a posteriori knowledge
from from D’s conditional probability table would
be:
g (A) =
a
g(A=a,D=1)
1
1.2
0
.8
Scheme(g) = {A}
Multiplication of Potentials
We next define two operations on potentials: Multiplication and
Marginalization.
Given a set of potentials, F, the multiplication of these potentials is itself a
potential. The value of each row, r, of a potential formed in this manner is
obtained from the product of the row of each function f in F which assigns
the variables of Scheme(f) the same values as they are assigned in r:
h(x) 
Obviously:

f ( x | Scheme ( f ))
f F
Scheme ( h ) 
 Scheme
(f)
f F
If f and g are potentials, when we perform the calculation/assignment
f=f.g (possible only when Scheme(g) ⊆ Scheme(f)) we will say we
multiply g into f.
Multiplication of Potentials
f (A,D) =
g(A,X) =
a d
f(A=a,D=d)
1 1
.3
1 0
.7
0 1
.9
0 0
.1
a x
g(A=a,X=x)
1 1
.7
1 0
.3
0 1
.5
0 0
.5
If h = f.g then:
h(A,D,X) =
a d
X
f(A=a,D=d)
1 1
1
.3 * .7 = .21
1 1
0
.3 * .3 = .09
1 0
1
.7 * .7 = .49
1 0
0
.7 * .3 = .21
0 1
1
.9 * .5 = .45
0 1
0
.9 * .5 = .45
0 0
1
.1 * .5 = .05
0 0
0
.1 * .5 = .05
Marginalization of Potentials
Given a potential f, we can marginalize out a set of variables from f
and the result is itself a potential. If i is such a potential, then:
Scheme ( i )  Scheme ( f )  { A}
Each row, r, in i is computed by summing the rows of f where the
variables in Scheme(f) have the values assigned to them by r.
If we wish to assign to a function variable the result of marginalizing
out one of its variables, we will simply say we marginalize out from
the function variable. So if g is the potential that results from
marginalizing out the variable D from our example potential, then:
g(A)=
a
g(A=a)
1
1
0
1
The Variable Elimination Algorithm
1.
2.
3.
4.
Perform a Topological Sort (Ancestral Ordering) on the Graph. This will provide us
with an ordering where no ancestor is before any of its descendants. This is
always possible since the graph is a DAG.
Construct a set of 'buckets', where there is one bucket associated with each
variable in the Network, b(i), and one additional bucket b∅. Each bucket will hold
a set of potentials (or constant functions in the case of b∅ ). The buckets are
ordered according to the topological order obtained in step one, with b∅ at the
beginning.
Convert the conditional probability tables of the network into potentials and
place them in the bucket associated with the largest variable in their scheme,
based on the ordering. If there are no variables in a potential's scheme, it is
placed in the null bucket.
Proceed in reverse order through the buckets:
i.
ii.
iii.
5.
Multiply all potentials in the bucket, producing a new potential, p.
Marginalize out the variable associated from the bucket from p, producing the potential p`.
Place p' in the bucket associated with the largest variable in its scheme.
Process the null bucket, which involves simply joining constant functions and is
simply scalar multiplication.
The Variable Elimination Algorithm
Bucket Ø
Bucket A
Bucket B
Bucket C
Bucket D
Bucket E
Bucket F
-From
Bucket A
{}
-Node A’s
CPT
{A}
-From
Bucket B
{A}
- Node B’s
CPT
{B}
-From
Bucket C
{A,B}
-Node C’s
CPT
{C}
-From
Bucket D
{A,B,C}
-Node D’s
CPT
{A,D}
-From
Bucket E
{B,C,D}
-Node E’s
CPT
{B,C,E}
-From
Bucket
F
{B,D,E}
-Node F’s
CPT
{B,D,E,F}
7.
Multiply
all, and
we have
result.
6.
Multiply
all and
marg out
A.
{}
5.
Multiply
all and
marg out
B.
{A}
4.
Multiply
all and
marg out
C.
{A,B}
3.
Multiply
all and
marg out
D.
{A,B,C}
2.
Multiply
all and
marg out
E
{B,C,D}
1.
(Multiply
all and)
Marg out
F.
{B,D,E}
Points to note
The algorithm produces the probability of the
evidence. So if it is run without any evidence, it simply
marginalizes all variables out and returns 1!
To actually get, say, the A Priori probabilities of each
variable, we have to run the algorithm repeatedly,
assigning each value to each variable (one at a time).
Likewise to get A Posteriori probabilities, we run the
algorithm on the evidence, then with the evidence and
the Variable-Value we are interested in, and divide the
second result by the first.
Conclusion
The Variable Elimination Algorithm is…
• VERY INEFFICIENT!!!
• (When supplemented) the only algorithm that
can estimate error bars for arbitrary nodes in a
Bayesian Network.
- once we have completed the algorithm as given, we
proceed backwards calculating the derivatives of the
functions involved. From these we can produce an
effective approximation of the variance of the
probability distribution, from which we estimate the
error bars.
INFERENCE 2:
THE JUNCTION TREE ALGORITHM
Junction Trees
• A Junction Tree is a secondary structure that
we construct from a Bayesian Network.
1.
2.
3.
4.
5.
Take copy of DAG and undirect edges
Moralize
Triangulate
Performed in
Find cliques of triangulated graph. a single step
Insert sepsets between cliques.
Build an optimal Junction Tree
1.
2.
3.
Begin with a set of n trees, each consisting of a single clique, and
an empty set S.
For each distinct pair of cliques X and Y, insert a candidate sepset
into S, containing all and only nodes in both X and Y.
Repeat until n-1 sepsets have been inserted into the forest.
A.
B.
C.
Choose the candidate sepset, C, which contains the largest number
of nodes, breaking ties by choosing the sepset which has the smaller
value product (the product of the number of values of the
nodes/variables in the sepset).
Delete C from S.
Insert C between the cliques X and Y only if X and Y are on different
trees in the forest. (NB This merges the two trees into a larger tree.)
Our DAG
B
A
C
D
E
F
Undirected
B
A
C
D
E
F
Moralized
A
B
D
C
E
F
Obtain Cliques from Triangulated
Graph whilst Triangulating
1. Take the moral graph, G1, and make a copy of it, G2.
2. While there are still nodes left in G2:
A. Select a node V from G2, such that V causes the least
number of edges to be added in Step 2b, breaking ties by
choosing the node that induces the cluster with the
smallest weight, where:
• The weight of a node V is the number of values of V .
• The weight of a cluster is the product of the weights of its
constituent nodes.
B. The node V and its neighbors in G2 form a cluster, C.
Connect all of the nodes in this cluster.
C. If C is not a sub-graph of a previous cluster, store C.
D. Remove V from G2.
Cliques
A,D
B,D,
E,F
B,C,E
Creating Separation Sets
1. Create n trees, each consisting of a single clique, and an
empty set S.
2. For each distinct pair of cliques X and Y, insert X ∩ Y into
S, recording the cliques this set was formed from.
3. Repeat until n-1 sepsets have been inserted into the
forest.
a.
b.
c.
Select from S the sepset, s, that has the largest number of
variables in it, breaking ties by choosing the set which has a
lower value product (the product of the number of values of
each variable in the set). Further ties can be broken arbitrarily.
Delete s from S.
Insert s between the cliques X and Y only if X and Y are on
different trees in the forest. (Note this merges these two trees
into a larger tree.)
Cliques and Sepsets
A,D
D
B,D,
E,F
B,E
B,C,E
Junction Trees and Sub-Graphs
• Junction Trees can be formed from subgraphs:
Find the smallest sub-graph d-separated from
the remainder of the graph by instantiated
nodes, that includes all nodes we wish to
predict a posteriori values for. Construct the
JT from this sub-graph. (Store created JTs for
reuse.)
A Message Pass
Passing a message from clique X to clique Y via
sepset S:
1. Save the potential associated with S.
2. Marginalize a new potential for S out of X.
3. Assign a new potential to Y, such that:
pot ( y ) new  pot ( y ) old  (
pot ( s ) new
pot ( s ) old
)
Collect and Disperse Evidence
COLLECT-EVIDENCE(X)
1. Mark X.
2. Call Collect-Evidence recursively on X's unmarked neighboring
clusters, if any.
3. Pass a message from X to the cluster which invoked CollectEvidence(X).
DISTRIBUTE-EVIDENCE(X)
1. Mark X.
2. Pass a message from X to each of its unmarked neighboring
clusters, if any.
3. Call Distribute-Evidence recursively on X's unmarked neighboring
clusters, if any.
Collect and Disperse Evidence
Evidence Potentials
Variable
Value 1
Value 2
Value 3
A
1
1
1
B
1
1
1
C
1
0
0
D
1
1
1
E
.7
.25
.05
F
1
1
0
We know:
• Variable C has value 1
• Variable F has value 1 or 2, but not 3
• Variable E has a 70% chance of being value 1, 20% chance of being value 2 and a 5%
chance of being value 3. (Soft evidence)
The Junction Tree Algorithm
1.
Initialize Junction Tree:
–
2.
Multiply in Conditional Probabilities
–
3.
For each node, find a clique containing the node and its parents (it will exist) and multiply in
the node’s conditional probability table to the clique’s potential.
Multiply in Evidence
–
4.
For each node, multiply any evidence that exists in to a clique where the node is present.
Disperse and Collect Evidence
–
5.
Pick an arbitrary root clique, and call collect evidence and then disperse evidence on this
clique.
Marginalize Out desired values
–
6.
Select the smallest clique for each node that you wish to obtain A Posteriori probabilities for
and marginalize out that node from the clique.
Normalize
–
NB
Associate with each clique and sepset a potential with all values set to 1.
Normalize the potential obtained from step 5.
If using soft evidence, steps 3 and 4 must be repeated until a desired level of
convergence is reached.
The Junction Tree Algorithm
• Very efficient: Calculates all a posteriori
probabilities at once.
• Numerous efficiency enhancements (see
literature).
• Complexity dominated by largest clique.
• Should be algorithm of choice unless seeking
error bars.
INFERENCE 3: SAMPLING
Logic Sampling
Given a set of variables, E, whose values, e, we know (or
are assuming), estimate a posteriori probabilities for the
other variables, U, in the network:
1.
2.
Perform a topological sort on the graph.
For each node in U, create a score card, with a number for
each value. Initially set these to 0.
Repeat:
3.
•
•
•
4.
Randomly generate values for each variable from their conditional
probability tables in the order generated in step 1.
Otherwise, for each node in U, add 1 to the score for the value it
has taken in this sample.
For each variable in U normalize its scorecard to obtain a
posteriori probabilities.
Logic Sampling
Problem:
We end up discarding too many cases when the
evidence has low probability (which is routinely
the case when dealing with large sets of evidence
variables).
DO NOT USE LOGIC SAMPLING
Likelihood Sampling
Given a set of variables, E, whose values, e, we know (or are
assuming), estimate a posteriori probabilities for the other
variables, U, in the network:
1.
2.
3.
Perform a topological sort on the graph.
Set all nodes in E to their known values.
For each node in U, create a score card, with a number for each
value. Initially set these to 0.
Repeat:
4.
•
•
•
5.
For each node in U, randomly generate values for each variable from their
conditional probability tables in the order generated in step 1.
Given the values assigned, calculate the probability, p, that E=e from these
nodes relative probability tables.
For each node in U, add p to the score for the value it has taken in this
sample.
For each variable in U, sum results and normalize.
LEARNING 1:
PARAMETER LEARNING
Dirichlet Probability Distributions
Definition: The Dirichlet Probability Distribution with parameters a1…
an, where:
n
N 
is:
a
m
m 1
 ( f 1 , f 2 ... f n 1 ) 
And:
n
(N )
f1
n
  (a
m
a1 1
a 2 1
, f2
a n 1
... f n
)
m 1
0  f m  1,  f m  1
m 0
Wonkish aside… There is one more parameter than function since the final
function is uniquely determined by those that came before it:
n 1
fn  1 

m 0
fm
Dirichlet Probability Distributions
• Don’t worry:
– If the probability of a random variable, X, taking
particular values from the set {a1, a2, … an} is given
by a Dirichlet distribution and N= a1 + a2 + … an,
then:
Pr( X  a i ) 
ai
N
– It is often said that Dirichlet distributions
represent the probabilities related with seeing
value ai occur ai out of N times.
Dirichlet Probability Distributions
• Let the binary variable X be represented by the
Dirichlet distribution Dir(f1,4,6)
Pr(X=v1)=.4
Pr(X=v2)=.6
• Likewise, let the binary variable Y be represented
by the Dirichlet distribution Dir(f1,40,60)
Pr(X=v1)=.4
Pr(X=v2)=.6
Dirichlet Probability Distributions
• However, our confidence in the probabilities
given for Y would be much higher than those
given for X (since so much more of the
distribution lies in the vicinity of these values).
• We shall also see that, for our purposes, the
probabilities for Y would be much more
resistant to emendation from new evidence
than those for X.
Learning Dirichlet Distributions
• Imagine we have a network topology and wish
to learn the parameters (conditional
probability distributions) associated with each
variable from the data set D:
Data Set D
A
B
1
1
1
0
0
1
1
0
0
1
A
B
0
1
?
?
A 0
1
0 ?
?
1 ?
?
Learning Dirichlet Distributions
• The basic procedure is:
1. Create a Dirichlet Distribution for each row of
each conditional probability table.
2. For each datum, for each node, find the row
in the conditional probability table which
corresponds to the values the node’s parents
take in the given datum, and add 1 to the
parameter associated with the value the
node takes in the datum.
Learning Dirichlet Distributions
• So we have 3 Dirichlet Distributions (let us just
show the parameters, which will be zero):
Data Set D
A
B
1
1
1
0
0
1
1
0
0
1
A
B
Dir(0,0)
A=0
Dir(0,0)
A=1
Dir(0,0)
Learning Dirichlet Distributions
• And we add the values from the data set.
Data Set
D
A
B
1
1
1
0
0
1
1
0
0
1
A
B
Dir(3,2)
0
1
.6
.4
A=0
Dir(0,2)
A 0
1
A=1
Dir(2,1)
0 0
1
1 .66
.33
Learning Dirichlet Distributions
• But... are we really willing to conclude that it
is 100% certain that if A=0, B=1 from two
observations?!?
NO
Data Set
D
A
A
B
1
1
1
0
0
1
1
0
0
1
B
Dir(3,2)
0
1
.6
.4
A=0
Dir(0,2)
A 0
1
A=1
Dir(2,1)
0 0
1
1 .66
.33
Learning Dirichlet Distributions
• To avoid such issues we do not commence
with Dirichlets having only zeros. Rather, if,
prior to observing any data, we assume that
all values are equally likely we might assign
the a priori distributions :
A
B
Dir(1,1)
A=0
Dir(1,1)
A=1
Dir(1,1)
Learning Dirichlet Distributions
• And we add the values from the data set.
Data Set
D
A
B
1
1
1
0
0
1
1
0
0
1
A
B
Dir(1+3,1+2)
0
1
.57 .43
A=0
Dir(1+0,1+2)
A 0
1
A=1
Dir(1+2,1+1)
0 .25
.75
1 .6
.4
Learning Dirichlet Distributions
• With these prior distributions we can:
– Ensure some reasonable conservativism!
– Encode defeasible domain knowledge.
– Set the responsiveness of our learning algorithm
to data. (Large values in the parameters will be
less responsive.)
LEARNING 2:
LEARNING STRUCTURE/TOPOLOGY
Topology Learning
• Given our algorithm for learning parameters
from Data we can learn topology:
Find the graph topology which, when its
parameters are learnt from the data, renders
the data most likely.
The Bayesian Scoring Criterion
n
Pr( d | G ) 
•
•
•
•
•
•
(N
  (N
i 1
•
•
•
•
(G )
qi
j 1
(G )
ij
(G )
ij
ri
)
M
(G )
ij

)
k 1
 (a
(G )
ijk
s
(G )
ijk
)
 ( a ijk )
(G )
d is our data
G is the graph we are scoring
n is the number of nodes in the graph
q is the number of parent value combinations the node i has in its conditional
probability table given G.
N is the sum of the Dirichlet prior parameters for the row of node i’s conditional
probability table corresponding to row j.
M is the sum of the learnt additions to the Dirichlet parameters for the same row.
r is the number of values node i has.
a is the Dirichlet prior parameter corresponding to value k in row j for node i in
graph G.
s is the sum of the learnt additions to the same parameter.
Γ is the Gamma function.
The Bayesian Scoring Criterion
n
Pr( d | G ) 
(G )
qi
(N
  (N
i 1
j 1
(G )
ij
(G )
ij
ri
)
M
(G )
ij

)
k 1
 (a
(G )
ijk
s
(G )
ijk
)
 ( a ijk )
(G )
Overview in words:
The probability of the Data given the graph is the
product of the probability that a node would
have the values it did when its parents had the
values they did, given our prior parameters and
data encountered thus far for every node and for
every row of that node’s conditional probability
table.
The Bayesian Scoring Criterion
n
Pr( d | G ) 
(G )
qi
(N
  (N
i 1
j 1
(G )
ij
(G )
ij
ri
)
M
(G )
ij

)
k 1
 (a
(G )
ijk
s
(G )
ijk
)
 ( a ijk )
(G )
The criterion is:
• Decomposable: It is the product of a score for
each node in the graph.
• Locally updateable: If we change a graph by
adding, removing or reversing an edge, we
need only rescore the nodes were or are now
Children on the edge in question.
Strategy: Greedy Hill Climb
•
•
•
•
Traverse the state space of graph topologies via
a number of operations on a graph (eg insert,
remove, reverse an edge).
At any point, calculate the alteration to the
graphs score any of these moves would make via
local scoring of the nodes affected, and choose
the best so long as it is an improvement.
Stop when no improvements are possible.
Use multiple restarts to try and avoid local
maxima.
(We could also use simulated annealing.)
Problems
(Simulated Annealing has the same problems.)
1. We would like (Markov) equivalence classes
of DAGs to get the same score, since they
represent the same conditional
independencies. Given our criterion, this will
not, in general, occur.
Problems
2. We would like each (Markov) equivalence
classes of DAGs to have the same a priori
probability of being that which is chosen.
Given our state space (all possible DAGs) and
search method (which can be stuck at local
maxima) this is not the case, since some
equivalence classes have massively more
instances than others.
Problems
3. A naïve implementation of the equation
given is complex: Each node scored would
them with the information from the data
(linear on the size of the data set) and then
multiply the Dirichlets (exponential on the
number of parents of the node).
Solution 1: Equivalent Sample Size
• We can ensure that (Markov) equivalent DAGs
have the same score by using an equivalent
sample size when we set up our Dirichlet
priors.
• The sum of the (initial – see solution three)
prior parameters of all Dirichlet distributions
associated with a node must equal a particular
number.
Solution 1: Equivalent Sample Size
Non-Equivalent Sample Size:
– For node A: 1+1=2
– For node B: 1+1+1+1=4.
A
B
Dir(1,1)
A=0
Dir(1,1)
A=1
Dir(1,1)
Solution 1: Equivalent Sample Size
Non-Equivalent Sample Size:
– For node A: 1+1=2
– For node B: 1+1+1+1=4.
A
B
Dir(1,1)
A=0
Dir(1,1)
A=1
Dir(1,1)
Solution 1: Equivalent Sample Size
Equivalent Sample Size of 4:
– For node A: 2+2=4
– For node B: 1+1+1+1=4.
A
B
Dir(2,2)
A=0
Dir(1,1)
A=1
Dir(1,1)
Solution 1: Equivalent Sample Size
Fresh concerns?
– Big ESSs lead to the parameters of nodes with no or
few parents being resistant to emendation from
evidence.
– Small ESSs lead to nodes with many parents having
conditional probability rows that occur only once or
twice.
Common to have small sizes – e.g. the number of
the values of the node with the most values.
Solution 2: DAG Patterns
• We ensure (Markov) equivalence classes of DAGs
to have the same a priori probability of being that
which is chosen by searching a state space of DAG
equivalence classes rather than DAGs.
• This is done by using DAG Patterns, form of nonreversible and reversible edges. (See the
• The operations commonly used for traversing
state space of DAG equivalence classes does not
permit the use of simulated annealing.
Solution 3: Tractable calculations
• The naïve implementation of the scoring
criterion can be improved. We note:
1. The result of the calculation is the same if we
add evidence one at a time or all at once:
(
 (2)
 (2  2)
((
((
 (2)
 ( 2  1)
 (3)
 ( 3  1)
)(
)(
)(
 (1  2 )
 (1)
 (1  1)
 (1)
 ( 2  1)
 (2)
)(
)(
)(
 (1  0 )
 (1)
 (1  0 )
 (1)
 (1  0 )
 (1)
)  2/6
)) 
))  (1 / 2 )( 4 / 6 )  2 / 6
Solution 3: Tractable calculations
• The naïve implementation of the scoring
criterion can be improved. We note:
2. Each Dirichlet distributions that includes only
the (original) prior parameters has a value of 1:
(G )
qi

j 1
(N
(G )
ij
(N
(G )
ij
)
ri

)
k 1
 (a
(G )
ijk
)
 (a
(G )
ijk
)
1
Solution 3: Tractable calculations
• The naïve implementation of the scoring criterion can
be improved:
1. Set the nodes score to 1.
2. For each datum, add the datum to the relevant Dirichlet,
and multiply the score by:
 ( N ij )
(G )
(N
(G )
ij
M
ri
(G )
ij

)
k 1
 ( a ijk  s ijk )
(G )
 (a
(G )
(G )
ijk
)
Where, if this parent combination has been encountered
before, N and each a is now the updated prior.
Complexity is linear on the size of the data set 
```