Representation of Undirected Graph Models

Koller/Friedman’s book Probabilistic Graphical Models: Principles and Techniques 1 (PGM) is the most exhaustive book I have found for PGM, however, there are many definitions, theorems and mathematics within it which make the book hard to understand. When I read chapter 4 of this book, I meet two questions: one is why the potential functions are defined on cliques, another is why the d-separation works. For the first question, I didn’t find a good answer from Koller/Friedman’s book, but, I found Micheal I. Jordan’s unfinished book An Introduction to Probabilistic Graphical Model [^4] gives some intuitions of why we do it like that. For the second one, Koller/Friedman’s book defines the d-separation in definition 4.9 and proves the soundness in theorem 4.1. When I first read theorem 4.1, I didn’t understand why the proof of theorem 4.1 could prove the soundness of definition 4.9. I got it after I read that part several times and I will share what I learned in this post.

Prerequisites

Some Important Concepts

If you feel Koller/Friedman’s PGM difficult to digest, you may not familiar with some concepts, so, we first clarify some concepts that I did not know when I first read this book:

  1. completeness: An axiom system is complete if all valid statements can be derived.
  2. soundness: An axiom system is sound if only valid statements can be derived.
  3. satisfiability: We say that a truth assignment $I$ satisfies a formula $\varphi$ if $\varphi[I]=\mathrm{T} ;$ we write this as $I \models \varphi$. A formula $\varphi$ is satisfiable if there exists a truth assignment $I$ such that $I \models \varphi$; otherwise $\varphi$ is unsatisfiable.
  4. validity: A formula $\varphi$ is valid (or a tautology) if for all a truth assignments $I, I \models \varphi$.

Proof of Exercise 2.5

In Koller/Friedman’s PGM book, the conclusion of exercise 2.5 is directly used to prove other theorems, but no proof is given. I googled this answer on the Internet, luckily, I found it and the proof is right in my view.

Exercise 2.5:

Let $X, Y, Z$ be three disjoint subsets of random variables. We say $X$ and $Y$ are conditionally independent given $Z$ if and only if

$$ \mathbb P_{X, Y \mid Z}(x, y \mid z)=\mathbb P_{X \mid Z}(x \mid z) \mathbb P_{Y \mid Z}(y \mid z) $$

Show that $X$ and $Y$ are conditionally independent given $Z$ if and only if the joint distribution for the three subsets of random variables factors in the following form:

$$ \mathbb P_{X, Y, Z}(x, y, \mid z)=h(x, z) g(y, z)) $$

Solution:

Fist, we proof if $X$ and $Y$ are conditionally independent given $Z$, we can get $\mathbb P_{X, Y, Z}(x, y, \mid z)=h(x, z) g(y, z))$ as follows:

$$ \begin{aligned} \mathbb P_{X, Y, Z}(x, y, z) &= \mathbb P_{Z}(z) \mathbb P_{X, Y | Z}(x, y | z) \\ &=\mathbb P_{Z}(z) \mathbb P_{X | Z}(x | z) \mathbb P_{Y | Z}(y | z) \\ &=h(x, z) g(y, z) \end{aligned} $$

where we choose $h(x, z)=\mathbb P_{X \mid Z}(x \mid z)$ and $g(y, z) = \mathbb P_{Y \mid Z}(y \mid z) \mathbb P_{Z}(z)$

Then, we proof that if we can write $\mathbb P_{X, Y, Z}(x, y, z)$ in this form: $h(x, z) g(y, z))$, we can get $\mathbb P_{X, Y \mid Z}(x, y \mid z)=\mathbb P_{X \mid Z}(x \mid z) \mathbb P_{Y \mid Z}(y \mid z)$:

$$ \begin{aligned} \mathbb P_{X, Y \mid Z}(z, y \mid z) &=\frac{\mathbb P_{X, Y, Z}(x, y, z)}{\sum_{x, y} \mathbb P_{X, Y, Z}(x, y, z)} \\ &=\frac{h(x, z) g(y, z)}{\sum_{x, y} h(x, z) g(y, z)} \\ &=\frac{h(x, z)}{h_{1}(z)} \frac{g(y, z)}{ g_{1}(z)} \\ &=\mathbb P_{X \mid Z}(x \mid z) \mathbb P_{Y \mid Z}(y \mid z) \end{aligned} $$

Definition 4.4

We say that a distribution factorizes over a Markov network $\mathcal{H}$ if each $\boldsymbol{D}_{k}(k=1, \ldots, K)$ is a complete subgraph of $\mathcal{H}$.

Gibbs Distribution

A distribution $P_{\Phi}$ is a Gibbs distribution parameterized by a set of factors $\Phi=\left\{\phi_{1}\left(\boldsymbol{D}_{1}\right), \ldots, \phi_{K}\left(\boldsymbol{D}_{K}\right)\right\}$ if it is defined as follows: $$ P_{\Phi}\left(X_{1}, \ldots, X_{n}\right)=\frac{1}{Z} \tilde{P}{\Phi}\left(X{1}, \ldots, X_{n}\right) $$ where $$ \tilde{P}{\Phi}\left(X{1}, \ldots, X_{n}\right)=\phi_{1}\left(\boldsymbol{D}{1}\right) \times \phi{2}\left(\boldsymbol{D}{2}\right) \times \cdots \times \phi{m}\left(\boldsymbol{D}{m}\right) $$ is an unnormalized measure and $$ Z=\sum{X_{.}} \tilde{P}{\Phi}\left(X{1}, \ldots, X_{n}\right) $$

If we say a Gibbs distribution which factorizes over a Markov network, each factor in this Gibbs distribution is over a clique or complete subgraph.

Definition 4.8

Let $\mathcal{H}$ be a Markov network structure, and let $X_{1}-\ldots-X_{k}$ be a path in $\mathcal{H}$. Let $\boldsymbol{Z} \subseteq \mathcal{X}$ be a set of observed variables. The path $X_{1}-\ldots-X_{k}$ is active given $\boldsymbol{Z}$ if none of the $X_{i}$ ’s, $i=1, \ldots, k$, is in $Z$.

Why UDG is defined on cliques?

$A–B–C$

From the graph above, we can get $A \bot C \mid B$, so, we don’t want $A$ and $C$ in the same potential function because they are independent if $B$ is given. We notice that there is no edge connecting $A$ and $C$ directly, so, we may have an intuition that nodes that are not connected to each other should not be included in the same potential function. This is the intuition why UDG is defined on cliques. We use potential functions to measure the relationship among nodes within the same clique, not the conditional probability or marginal probability. For instance, we can use the chain rule to write the distribution $P$ factorizing over the graph above like this: $P = P(A)P(B \mid A)P(C \mid B)$. If you push $P$ into two factors, you may select one of the two forms: $P = P(A,B)P(C \mid B)$ and $P = P(B, C)P(A \mid B)$. Don’t forget that our distribution defined on UDG can be factorized like this $P = \Phi(A, B)\Phi(B,C)$. Then, we can get the conclusion that not all the potential functions are marginal probabilities or conditional probabilities. In one word, potential functions are not probabilities. How can we make sure the distribution consisting of potential functions equals 1 if we summarize all the variables within the distribution? We need a normalizer $\mathcal{Z}$. From definition 4.4 and the definition of Gibbs distribution, we know that we can use a Gibbs distribution which factorizes over the Markov network to represent the distribution expressed by the Markov network.

Koller/Firedman says in the PGM book “if we define the UDG on cliques, we will not violate the independence assumptions induced by the network structure”. In my view, the authors here want to say that if we use the Gibbs distribution which factorizes over the Markov network to represent the distribution expressed by the network, we can prove the theorem 4.1 which also proves the soundness of the d-separation which is the independence assumptions induced by the network.

Why D-seperation works?

Definition 4.9 defines the d-separation of UDG. The proof of definition 4.9 will answer this question.

Definition 4.9:

We say that a set of nodes $\boldsymbol{Z}$ separates $\boldsymbol{X}$ and $\boldsymbol{Y}$ in $\mathcal{H}$, denoted $\operatorname{sep}_{\mathcal{H}}(\boldsymbol{X} ; \boldsymbol{Y} \mid \boldsymbol{Z})$, if there is no active path between any node $X \in X$ and $Y \in \boldsymbol{Y}$ given $\boldsymbol{Z}$. We define the global independencies associated with $\mathcal{H}$ to be:

$$ \mathcal{I}(\mathcal{H})=\left\{(\boldsymbol{X} \perp \boldsymbol{Y} \mid \boldsymbol{Z}): \operatorname{sep}_{\mathcal{H}}(\boldsymbol{X} ; \boldsymbol{Y} \mid \boldsymbol{Z})\right\} $$

We prove the soundness and completeness of definition 4.9. For definition 4.9, the soundness and completeness are defined as follows:

  • If we find that two nodes $X$ and $Y$ are d-separated given some $Z$, then we are guaranteed that they are, in fact, conditionally independent given $Z$.
  • D-separation detects all possible independencies. More precisely, if we have that two variables $X$ and $Y$ are independent given $Z$, then they are d-separated.

Before we proof the soundness, we know that there is a Gibbs distribution $\mathcal{P}$ over $\mathcal{X}$ factorizes over $\mathcal{H}$ can be expressed by the Markov network $\mathcal{H}$. Let $\boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{Z}$ be any three disjoint subsets in $\mathcal{X}$ such that $\boldsymbol{Z}$ separates $\boldsymbol{X}$ and $\boldsymbol{Y}$ in $\mathcal{H}$. We want to show that $P \models(\boldsymbol{X} \perp \boldsymbol{Y} \mid \boldsymbol{Z})$.

We start by considering the case where $\boldsymbol{X} \cup \boldsymbol{Y} \cup \boldsymbol{Z}=\mathcal{X} .$ As $\boldsymbol{Z}$ separates $\boldsymbol{X}$ from $\boldsymbol{Y}$, there are no direct edges between $\boldsymbol{X}$ and $\boldsymbol{Y}$. Hence, any clique in $\mathcal{H}$ is fully contained either in $\boldsymbol{X} \cup \boldsymbol{Z}$ or in $\boldsymbol{Y} \cup \boldsymbol{Z}$. Let $\mathcal{I}_{\boldsymbol{X}}$ be the indexes of the set of cliques that are contained in $\boldsymbol{X} \cup \boldsymbol{Z}$, and let $\mathcal{I}_{Y}$ be the indexes of the remaining cliques. We know that

$$ P\left(X_{1}, \ldots, X_{n}\right)=\frac{1}{Z} \prod_{i \in \mathcal{I}{\boldsymbol{X}}} \phi{i}\left(\boldsymbol{D}{i}\right) \cdot \prod{i \in \mathcal{I}{\boldsymbol{Y}}} \phi{i}\left(\boldsymbol{D}_{i}\right) $$

As we discussed, none of the factors in the first product involve any variable in $\boldsymbol{Y}$, and none in the second product involve any variable in $\boldsymbol{X}$. Hence, we can rewrite this product in the form:

$$ P\left(X_{1}, \ldots, X_{n}\right)=\frac{1}{Z} f(\boldsymbol{X}, \boldsymbol{Z}) g(\boldsymbol{Y}, \boldsymbol{Z}) $$

From this decomposition, the desired independence follows immediately (exercise 2.5). Now consider the case where $\boldsymbol{X} \cup \boldsymbol{Y} \cup \boldsymbol{Z} \subset \mathcal{X}$. Let $\boldsymbol{U}=\mathcal{X}-(\boldsymbol{X} \cup \boldsymbol{Y} \cup \boldsymbol{Z})$. We can partition $\boldsymbol{U}$ into two disjoint sets $\boldsymbol{U}_{1}$ and $\boldsymbol{U}_{2}$ such that $\boldsymbol{Z}$ separates $\boldsymbol{X} \cup \boldsymbol{U}_{1}$ from $\boldsymbol{Y} \cup \boldsymbol{U}_{2}$ in $\mathcal{H}$. Using the preceding argument, we conclude that $P \models\left(\boldsymbol{X}, \boldsymbol{U}_{1} \perp \boldsymbol{Y}, \boldsymbol{U}_{2} \mid \boldsymbol{Z}\right)$. Using the decomposition property $(\boldsymbol{X} \perp \boldsymbol{Y}, \boldsymbol{W} \mid \boldsymbol{Z}) \Longrightarrow(\boldsymbol{X} \perp \boldsymbol{Y} \mid \boldsymbol{Z})$ (equation 2.8), we conclude that $P \models(\boldsymbol{X} \perp \boldsymbol{Y} \mid \boldsymbol{Z})$.

Let’s look at theorem 4.1

Theorem 4.1:

Let $P$ be a distribution over $\mathcal{X}$, and $\mathcal{H}$ a Markov network structure over $\mathcal{X}$. If $P$ is a Gibbs distribution that factorizes over $\mathcal{H}$, then $\mathcal{H}$ is an I-map for $P$.

The proof of theorem 4.1 is the same as the proof of soundness, so, the authors say that “We first consider the analog to theorem 3.2, which asserts that a Gibbs distribution satisfies the independencies associated with the graph. In other words, this result states the soundness of the separation criterion.”

I will talk about completeness in another post.


  1. Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press. ↩︎