# Stat946f11pool

## Contents

### Undirected Graphical Models

In the previous sections we discussed the Bayes Ball algorithm and the way we can use it to determine if there exists a conditional independence between two nodes in the graph. This algorithm can be easily modified to allow us to determine the same information in an undirected graph. An undirected graph that provides information about the relationships between different random variables can also be called a "Markov Random Field".

As before we must define a set of canonical graphs. The nice thing is that for undirected graphs there is really only one type of canonical graph:

Fig.20 The only way to connect 3 nodes in an undirected graph.

In the first figure (Fig. 21) we have no information about the node Y and so we can not say if the nodes X and Z are independent since the ball can pass from one to the other. On the other hand, in (Fig. 22) the value of Y is known and so the ball can not pass from X to Z or from Z to X. In this case we can say the X and Z are independent given Y.

$X \amalg Z | Y$
Fig.21 The ball can pass through the middle node.
Fig.22 The ball can not pass through the middle node.

Now that we have a type of Bayes Ball algorithm for both directed and undirected graphs we can ask ourselves the question: Is there an algorithm or method that we can use to convert between directed and undirected graphs?

In general: NO.
In fact, not only does there not exist a method for conversion but some graphs do not have an equivalent and may exist only in the undirected or directed form. Take the following undirected graph (Fig. 23). We can see that the radom variables that are represented in this graph have the following properties:

$X \amalg Y | \lbrace W, Z \rbrace$
$W \amalg Z | \lbrace X, Y \rbrace$
Fig.23 There is no directed equivalent to this graph.

Now try building a directed graph with the same properties taking into consideration that directed graphs cannot contain a cycle. Under this restriction it is in fact impossible to find an equivalent directed graph that satisfies all of the above properties. Similarly, consider the following directed graph (Fig. 24). It can not be represented by any undirected graph with 3 nodes.

Fig.24 There is no undirected equivalent to this graph.

When we want to graph the relationships between a set of random variables it is important to consider both graph types since some relationships can only be graphed on a certain type of graph. We must therefore conclude that undirected graphs are just as important as the directed ones. For the directed graphs we have an expression for P(xV). We should try to develop a similar statement for the undirected graphs.
In order to develop the expression we need to introduce more terminology.

• Clique -

A subset of fully connected nodes in a graph G. Every node in the clique C is directly connected to every other node in C.

• Maximal Clique -

A clique where if any other node from the graph G is added to it then the new set is no longer a clique.

Let C = / SetofallMaximalCliques / .
Let $\psi_{c_i}$ = A non-negative real valued function.
Now associate one $\psi_{c_i}$ with each clique ci then,

$P(x_{V}) = \frac{1}{Z(\Psi)} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})$

Where,

$Z(\Psi) = \sum_{x_v} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})$

#### Conditional independence

For directed graphs Bayes ball method was defined to determine the conditional independence properties of a given graph. We can also employ the Bayes ball algorithm to examine the conditional independency of undirected graphs. Here the Bayes ball rule is simpler and more intuitive. Considering Figure.... , a ball can be thrown either from x to z or from z to x if y is not observed. In other words, if y is not observed a ball thrown from x can reach z and vice versa. On the contrary, given a shaded y, the node can block the ball and make x and z conditionally independent. With this definition one can declare that in an undirected graph, a node is conditionally independent of non-neighbors given neighbours. Technically speaking, XA is independent of XC given XB if the set of nodes XB separates the nodes XA from the nodes XC. Hence, if every path from a node in XA to a node in XC includes at least one node in XB, then we claim that $X_A \perp X_c | X_B$.

## Graphical Algorithms

In the previous chapter there were two kinds of graphical models that were used to represent dependencies between variables. One is a directed graphical model while the other is an undirected graphical model. In the case of directed graphs we can define the joint probability distribution based on a product of conditional probabilities where each node is conditioned on the value(s) of its parent(s). In the case of the undirected graphs we can define the joint probability distribution based on the normalized product of ψ functions based on the nodes that form maximal cliques in the graph. A maximal clique is a clique where we can not add an additional node such that the clique remains fully connected.
In the previous chapter we also developed the following two expressions for P(xV):

#### For Directed Graphs:

$P(x_V) = \prod_{i=1}^{n} P(x_i | x_{\pi_i})$

#### For Undirected Graphs:

$P(x_{V}) = \frac{1}{Z(\Psi)} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i})$

#### Theorem: Hammersley - Clifford

If we allow U1 to represent the set of all the decompositions of P(xV) based on a certain graphical representation and we allow U2 to represent all possible conditional probabilities of those nodes then we will find that the sets U1 and U2 are in fact the same set.

$U_{1} = \left \{ P(x_{V}) = \frac{1}{Z(\psi)} \prod_{c_i \epsilon C} \psi_{c_i} (x_{c_i}) \right \}$
$U_{2} = \left \{ P(x_{V}) | P(x_{V}) \mbox{ satisfies all conditional probabilities} \right \}$
Then: U1 = U2

There is a lot of information contained in the joint probability distribution P(xV). We have defined 6 tasks (listed bellow) that we would like to accomplish with various algorithms for a given disribution P(xV). These algorithms may each be able to perform a subset of the tasks listed bellow.

• Marginalization

Given P(xV) find P(xA)
\underline{ex.} Given P(x1,x2,...,x6) find P(x2,x6)

• Conditioning

Given P(xV) find $P(x_A|x_B) = \frac{P(x_A, x_B)}{P(x_B)}$ .

• Evaluation

Evaluate the probability for a certain configuration.

• Completion

Compute the most probable configuration. In other words, which of the P(xA | xB) is the largest for a specific combinations of A and B.

• Simulation

Generate a random configuration for P(xV) .

• Learning

We would like to find parameters for P(xV) .

### Exact Algorithms:

We will be looking at three exact algorithms. An exact algorithm is an algorithm that will find the exact answer to one of the above tasks. The main disadvantage to the exact algorithms approach is that for large graphs which have a large number of nodes these algorithms take a long time to produce a result. When this occurs we can use inexact algorithms to more efficiently find a useful estimate.

• Elimination
• Sum-Product
• Junction Tree

### General Inference:

Let us first define a set of nodes called Evidence Nodes. We will denote evidence nodes with xE. These nodes represent the random varibles about which we have information. Similarily, let us define the set of nodes xF as Query Nodes. These are the set of nodes for which we seek information. By Bayes Theorem we know that:

$P(x_F|x_E) = \frac{P(x_F,x_E)}{P(x_E)}$

Let G(V,ε) be a graph with vertices V and edges ε

The group of nodes V is made up of the evidence nodes E, the query nodes F and the nodes that are neither query nor evidence nodes R. We can just call R the remainder nodes. All of these sets are mutually exclusive therefore,
$V = E \cup F \cup R$ and $R = V / (E \cup F)$

 P(xF,xE) = ∑ P(xV) = ∑ P(xE,xF,xR) R R

Example:
Consider once again the example from Figure \ref{fig:ClassicExample1}. Suppose we want to calculate $P(x_1|\bar{x}_6)$. Where $\bar{x}_6$ refers to a fixed value of x6.

If we represent the joint probabilities normally we have, $P(x_1, x_2, ..., x_5) = \sum_{x_6}P(x_1, x_2, ..., x_6)$ which represents a table of probabilities of size 26. In general this table is of size kn where k is the number of values each variable can take on and n is the number of vertices. In a computer algorithm this is exponential: O(kn)

We can reduce the complexity if we represent the probabilities in factored form.

$\begin{matrix} P(x_1, x_2, ..., x_5) &= \sum_{x_6} P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_3)P(x_6|x_2, x_5)
&= P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_3) \sum_{x_6} P(x_6|x_2, x_5) \end{matrix}$

Where the computational complexity is only O(nkr) where r is the number of parents of a node. In our case the table has been reduced to 23 from 26.

Let $m_i(x_{s_i})$ be the expression that arises when we perform $\sum_{x_i} P(x_i|x_{s_i})$ where $x_{s_i}$ represents a set of variables other than xi.
For instance, in our example we can say that $m_6(x_1, x_2) = \sum_{x_6} P(x_6|x_1, x_2)$ .

We know that according to Bayes Theorem we can calculate $P(x_1, \bar{x}_6)$ and $P(\bar{x}_6)$ separately in order to find the desired conditional probability.

$P(x_1|\bar{x}_6) = \frac{P(x_1, \bar{x}_6)}{P(\bar{x}_6)}$<center>

Let us begin by calculating $P(x_1, \bar{x}_6)$ .

<center>$\begin{matrix} P(x_1|\bar{x}_6) &= \sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_3)P(\bar{x}_6|x_2, x_5) \\ &= P(x_1)\sum_{x_2}P(x_2|x_1)\sum_{x_3}P(x_3|x_1)\sum_{x_4}P(x_4|x_2)\sum_{x_5}P(x_5|x_3)P(\bar{x}_6|x_2, x_5) \\ &= P(x_1)\sum_{x_2}P(x_2|x_1)\sum_{x_3}P(x_3|x_1)\sum_{x_4}P(x_4|x_2)m_5(x_2, x_3, \bar{x}_6) \\ &= P(x_1)\sum_{x_2}P(x_2|x_1)\sum_{x_3}P(x_3|x_1)m_5(x_2, x_3, \bar{x}_6)\sum_{x_4}P(x_4|x_2) \\ &= P(x_1)\sum_{x_2}P(x_2|x_1)\sum_{x_3}P(x_3|x_1)m_5(x_2, x_3, \bar{x}_6)m_4(x_2) \\ &= P(x_1)\sum_{x_2}P(x_2|x_1)m_4(x_2)m_3(x_1, x_2, \bar{x}_6) \\ &= P(x_1)m_2(x_1,\bar{x}_6) \end{matrix}$

And then we can use the above result to calculate the next desired probability. : $P(\bar{x}_6) = \sum_{x_1}P(x_1|\bar{x}_6)$.

Finally, by using the above two results we can calculate $P(x_1|\bar{x}_6) = \frac{P(x_1, \bar{x}_6)}{P(\bar{x}_6)}$.

### Evaluation

Define Xi as an evidence node whose observed value is $\overline{x_i}$. To show that Xi is fixed at the value $\overline{x_i}$, we define an evidence potential $\delta{(x_i,\overline{x_i})}$ whose value is 1 if xi = $\overline{x_i}$ and 0 otherwise.
So

$g(\overline{x_i}) =\sum_{x_i}{g(x_i)\delta{(x_i,\overline{x_i})}}$

When we have more than one variable such as p(F$|\overline{E}$), the total evidence potential is:

$\delta{(x_i,\overline{x_E})}= \prod_{i\in E}\delta{(x_i,\overline{x_i})}$

### Elimination and Directed Graphs

Given a graph G =(V,E), an evidence set E, and a query node F, we first choose an elimination ordering I such that F appears last in this ordering.

Example:
For the graph in (Fig. \ref{fig:ClassicExample1}): G = (V,''E''). Consider once again that node x1 is the query node and x6 is the evidence node.
$I = \left\{6,5,4,3,2,1\right\}$ (1 should be the last node, ordering is crucial)
We must now crete an active list. There are two rules that must be followed in order to create this list.

1. For i$\in{V}$ put $p(x_i|x_{\pi_i})$ in active list.
2. For i$\in${E} put $p(x_i|\overline{x_i})$ in active list.

Here, our active list is: $p(x_1), p(x_2|x_1), p(x_3|x_1), p(x_3|x_2), p(x_5|x_3),\underbrace{p(x_6|x_2, x_5)\delta{(\overline{x_6},x_6)}}_{\phi_6(x_2,x_5, x_6), \sum_{x6}{\phi_6}=m_{6}(x2,x5) }$

We first eliminate node X6. We place m6(x2,x5) on the active list, having removed X6. We now eliminate X5.

$\underbrace{p(x_5|x_3)*m_6(x_2,x_5)}_{m_5(x_2,x_3)}$

Likewise, we can also eliminate X4,X3,X2(which yields the unnormalized conditional probability $p(x_1|\overline{x_6})$ and X1. Then it yields $m_1 = \sum_{x_1}{\phi_1(x_1)}$ which is the normalization factor, $p(\overline{x_6})$.

#### Elimination and Undirected Graphs

We would also like to do this elimination on undirected graphs such as G'.

Fig.XX Undirected graph G'

The first task is to find the maximal cliques and their associated potential functions.
maximal clique: $\left\{x_1, x_2\right\}$, $\left\{x_1, x_3\right\}$, $\left\{x_2, x_4\right\}$, $\left\{x_3, x_5\right\}$, $\left\{x_2,x_5,x_6\right\}$
potential functions: $\varphi{(x_1,x_2)},\varphi{(x_1,x_3)},\varphi{(x_2,x_4)}, \varphi{(x_3,x_5)}$ and $\varphi{(x_2,x_3,x_6)}$

$p(x_1|\overline{x_6})=p(x_1,\overline{x_6})/p(\overline{x_6})\cdots\cdots\cdots\cdots\cdots(*)$

$p(x_1,x_6)=\frac{1}{Z}\sum_{x_2,x_3,x_4,x_5,x_6}\varphi{(x_1,x_2)}\varphi{(x_1,x_3)}\varphi{(x_2,x_4)}\varphi{(x_3,x_5)}\varphi{(x_2,x_3,x_6)}\delta{(x_6,\overline{x_6})}$

The $\frac{1}{Z}$ looks crucial, but in fact it has no effect because for (*) both the numerator and the denominator have the $\frac{1}{Z}$ term. So in this case we can just cancel it.
The general rule for elimination in an undirected graph is that we can remove a node as long as we connect all of the parents of that node together. Effectively, we form a clique out of the parents of that node.

Example:
For the graph G in (Fig. \ref{fig:Ex1Lab})
when we remove x1, G becomes (Fig. \ref{fig:Ex2Lab})
if we remove x2, G becomes (Fig. \ref{fig:Ex3Lab})

Fig.XX
Fig.XX
Fig.XX

An interesting thing to point out is that the order of the elimination matters a great deal. Consider the two results. If we remove one node the graph complexity is slightly reduced. (Fig. \ref{fig:Ex2Lab}). But if we try to remove another node the complexity is significantly increased. (Fig. \ref{fig:Ex3Lab}). The reason why we even care about the complexity of the graph is because the complexity of a graph denotes the number of calculations that are required to answer questions about that graph. If we had a huge graph with thousands of nodes the order of the node removal would be key in the complexity of the algorithm. Unfortunately, there is no efficient algorithm that can produce the optimal node removal order such that the elimination algorithm would run quickly.

### Moralization

So far we have shown how to use elimination to successively remove nodes from an undirected graph. We know that this is useful in the process of marginalization. We can now turn to the question of what will happen when we have a directed graph. It would be nice if we could somehow reduce the directed graph to an undirected form and then apply the previous elimination algorithm. This reduction is called moralization and the graph that is produced is called a moral graph.

To moralize a graph we first need to connect the parents of each node together. This makes sense intuitively because the parents of a node need to be considered together in the undirected graph and this is only done if they form a type of clique. By connecting them together we create this clique.

After the parents are connected together we can just drop the orientation on the edges in the directed graph. By removing the directions we force the graph to become undirected.

The previous elimination algorithm can now be applied to the new moral graph. We can do this by assuming that the probability functions in directed graph $P(x_i|\pi_{x_i})$ are the same as the mass functions from the undirected graph. $\psi_{c_i}(c_{x_i})$

Example:
I = $\left\{x_6,x_5,x_4,x_3,x_2,x_1\right\}$
When we moralize the directed graph (Fig. \ref{fig:Moral1}), then it becomes the undirected graph (Fig. \ref{fig:Moral2}).

Fig.XX Original Directed Graph
Fig.XX Moral Undirected Graph

### Sum Product Algorithm

One of the main disadvantages to the elimination algorithm is that the ordering of the nodes defines the number of calculations that are required to produce a result. The optimal ordering is difficult to calculate and without a decent ordering the algorithm may become very slow. In response to this we can introduce the sum product algorithm. It has one major advantage over the elimination algorithm: it is faster. The sum product algorithm has the same complexity when it has to compute the probability of one node as it does to compute the probability of all the nodes in the graph. Unfortunately, the sum product algorithm also has one disadvantage. Unlike the elimination algorithm it can not be used on any graph. The sum product algorithm works only on trees.

For undirected graphs if there is only one path between any two pair of nodes then that graph is a tree (Fig. \ref{fig:UnDirTree}). If we have a directed graph then we must moralize it first. If the moral graph is a tree then the directed graph is also considered a tree (Fig. \ref{fig:DirTree}).

Fig.XX Undirected tree
Fig.XX Directed tree

For the undirected graph $G(v, \varepsilon)$ (Fig. \ref{fig:UnDirTree}) we can write the joint probability distribution function in the following way.

$P(x_v) = \frac{1}{Z(\psi)}\prod_{i \varepsilon v}\psi(x_i)\prod_{i,j \varepsilon \varepsilon}\psi(x_i, x_j)$

We know that in general we can not convert a directed graph into an undirected graph. There is however an exception to this rule when it comes to trees. In the case of a directed tree there is an algorithm that allows us to convert it to an undirected tree with the same properties.
Take the above example (Fig. \ref{fig:DirTree}) of a directed tree. We can write the joint probability distribution function as:

P(xv) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2)P(x5 | x2)

If we want to convert this graph to the undirected form shown in (Fig. \ref{fig:UnDirTree}) then we can use the following set of rules. \begin{thinlist}

• If γ is the root then: ψ(xγ) = P(xγ).
• If γ is NOT the root then: ψ(xγ) = 1.
• If $\left\lbrace i \right\rbrace$ = πj then: ψ(xi,xj) = P(xj | xi).

\end{thinlist} So now we can rewrite the above equation for (Fig. \ref{fig:DirTree}) as:

$P(x_v) = \frac{1}{Z(\psi)}\psi(x_1)...\psi(x_5)\psi(x_1, x_2)\psi(x_1, x_3)\psi(x_2, x_4)\psi(x_2, x_5)$
$= \frac{1}{Z(\psi)}P(x_1)P(x_2|x_1)P(x_3|x_1)P(x_4|x_2)P(x_5|x_2)$

#### Elimination Algorithm on a Tree

Fig.XX Message-passing in Elimination Algorithm

We will derive \textsc{Sum-Product} algorithm from the point of view of the \textsc{Eliminate} algorithm. To marginalize x1 in (Fig. \ref{fig:TreeStdEx}),

$\begin{matrix} p(x_i)&=&\sum_{x_2}\sum_{x_3}\sum_{x_4}\sum_{x_5}p(x_1)p(x_2|x_1)p(x_3|x_2)p(x_4|x_2)p(x_5|x_3) \\ &=&p(x_1)\sum_{x_2}p(x_2|x_1)\sum_{x_3}p(x_3|x_2)\sum_{x_4}p(x_4|x_2)\underbrace{\sum_{x_5}p(x_5|x_3)} \\ &=&p(x_1)\sum_{x_2}p(x_2|x_1)\underbrace{\sum_{x_3}p(x_3|x_2)m_5(x_3)}\underbrace{\sum_{x_4}p(x_4|x_2)} \\ &=&p(x_1)\underbrace{\sum_{x_2}m_3(x_2)m_4(x_2)} \\ &=&p(x_1)m_2(x_1) \end{matrix}$

where,

$\begin{matrix} m_5(x_3)=\sum_{x_5}p(x_5|x_3)=\psi(x_5)\psi(x_5,x_3)=\mathbf{m_{53}(x_3)} \\ m_4(x_2)=\sum_{x_4}p(x_4|x_2)=\psi(x_4)\psi(x_4,x_2)=\mathbf{m_{42}(x_2)} \\ m_3(x_2)=\sum_{x_3}p(x_3|x_2)=\psi(x_3)\psi(x_3,x_2)m_5(x_3)=\mathbf{m_{32}(x_2)}, \end{matrix}$

which is essentially (potential of the node)$\times$(potential of the edge)$\times$(message from the child).

The term "mji(xi)" represents the intermediate factor between the eliminated variable, j, and the remaining neighbor of the variable, i. Thus, in the above case, we will use m53(x3) to denote m5(x3), m42(x2) to denote m4(x2), and m32(x2) to denote m3(x2). We refer to the intermediate factor mji(xi) as a "message" that j sends to i. (Fig. \ref{fig:TreeStdEx})

In general,
$\begin{matrix} m_{ji}=\sum_{x_i}( \psi(x_j)\psi(x_j,x_i)\prod_{k\in{\mathcal{N}(j)/ i}}m_{kj}) \end{matrix}$

#### Elimination To Sum Product Algorithm

Fig.XX All of the messages needed to compute all singleton marginals

The Sum-Product algorithm allows us to compute all marginals in the tree by passing messages inward from the leaves of the tree to an (arbitrary) root, and then passing it outward from the root to the leaves, again using (\ref{equ:MsgEquation}) at each step. The net effect is that a single message will flow in both directions along each edge. (See Figure \ref{fig:SumProdEx}) Once all such messages have been computed using (\ref{equ:MsgEquation}), we can compute desired marginals.

As shown in Figure \ref{fig:SumProdEx}, to compute the marginal of X1 using elimination, we eliminate X5, which involves computing a message m53(x3), then eliminate X4 and X3 which involves messages m32(x2) and m42(x2). We subsequently eliminate X2, which creates a message m21(x1).

Suppose that we want to compute the marginal of X2. As shown in Figure \ref{fig:MsgsFormed}, we first eliminate X5, which creates m53(x3), and then eliminate X3, X4, and X1, passing messages m32(x2), m42(x2) and m12(x2) to X2.

Fig.XX The messages formed when computing the marginal of X2

Since the messages can be "reused", marginals over all possible elimination orderings can be computed by computing all possible messages which is small in numbers compared to the number of possible elimination orderings.

The Sum-Product algorithm is not only based on equation (\ref{equ:MsgEquation}), but also Message-Passing Protocol. Message-Passing Protocol tells us that \textit{a node can send a message to a neighbouring node when (and only when) it has received messages from all of its other neighbors}.

#### For Directed Graph

Previously we stated that:

$p(x_F,\bar{x}_E)=\sum_{x_E}p(x_F,x_E)\delta(x_E,\bar{x}_E),$

Using the above equation (\ref{eqn:Marginal}), we find the marginal of $\bar{x}_E$.

$\begin{matrix} p(\bar{x}_E)&=&\sum_{x_F}\sum_{x_E}p(x_F,x_E)\delta(x_F,\bar{x}_E) \\ &=&\sum_{x_v}p(x_F,x_E)\delta (x_E,\bar{x}_E) \end{matrix}$

Now we denote:

$p^E(x_v) = p(x_v) \delta (x_E,\bar{x}_E)$

Since the sets, F and E, add up to $\mathcal{V}$, p(xv) is equal to p(xF,xE). Thus we can substitute the equation (\ref{eqn:Dir8}) into (\ref{eqn:Marginal}) and (\ref{eqn:Dir7}), and they become:

$\begin{matrix} p(x_F,\bar{x}_E) = \sum_{x_E} p^E(x_v), \\ p(\bar{x}_E) = \sum_{x_v}p^E(x_v) \end{matrix}$

We are interested in finding the conditional probability. We substitute previous results, (\ref{eqn:Dir9}) and (\ref{eqn:Dir10}) into the conditional probability equation.

$\begin{matrix} p(x_F|\bar{x}_E)&=&\frac{p(x_F,\bar{x}_E)}{p(\bar{x}_E)} \\ &=&\frac{\sum_{x_E}p^E(x_v)}{\sum_{x_v}p^E(x_v)} \end{matrix}$

pE(xv) is an unnormalized version of conditional probability, $p(x_F|\bar{x}_E)$.

#### For Undirected Graphs

We denote ψE to be:

$\begin{matrix} \psi^E(x_i) = \psi(x_i)\delta(x_i,\bar{x}_i),& & if i\in{E} \\ \psi^E(x_i) = \psi(x_i),& & otherwise \end{matrix}$

### Max-Product

We would like to find the Maximum probability that can be achieved by some set of random variables given a set of configurations. The algorithm is similar to the sum product except we replace the sum with max.

Fig.XX Max Product Example
$\begin{matrix} \max_{x_1}{P(x_i)} & = & \max_{x_1}\max_{x_2}\max_{x_3}\max_{x_4}\max_{x_5}{P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_2)P(x_5|x_3)} \\ & = & \max_{x_1}{P(x_1)}\max_{x_2}{P(x_2|x_1)}\max_{x_3}{P(x_3|x_4)}\max_{x_4}{P(x_4|x_2)}\max_{x_5}{P(x_5|x_3)} \end{matrix}$

$p(x_F|\bar{x}_E)$

$m_{ji}(x_i)=\sum_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}$
$m^{max}_{ji}(x_i)=\max_{x_j}{\psi^{E}{(x_j)}\psi{(x_i,x_j)}\prod_{k\in{N(j)\backslash{i}}}m_{kj}}$

Example: Consider the graph in Figure \ref{fig:MaxProdEx}.

$m^{max}_{53}(x_5)=\max_{x_5}{\psi^{E}{(x_5)}\psi{(x_3,x_5)}}$
$m^{max}_{32}(x_3)=\max_{x_3}{\psi^{E}{(x_3)}\psi{(x_3,x_5)}m^{max}_{5,3}}$

### Maximum configuration

We would also like to find the value of the xis which produces the largest value for the given expression. To do this we replace the max from the previous section with argmax.
$m_{53}(x_5)= argmax_{x_5}\psi{(x_5)}\psi{(x_5,x_3)}$
$\log{m^{max}_{ji}(x_i)}=\max_{x_j}{\log{\psi^{E}{(x_j)}}}+\log{\psi{(x_i,x_j)}}+\sum_{k\in{N(j)\backslash{i}}}\log{m^{max}_{kj}{(x_j)}}$
In many cases we want to use the log of this expression because the numbers tend to be very high. Also, it is important to note that this also works in the continuous case where we replace the summation sign with an integral.

## Basic Statistical Problems

In statistics there are a number of different 'standard' problems that always appear in one form or another. They are as follows: \begin{thinlist}

• Regression
• Classification
• Clustering
• Density Estimation

\end{thinlist}

### Regression

In regression we have a set of data points (xi,yi) for i = 1...n and we would like to determine the way that the variables x and y are related. In certain cases such as (Fig. \ref{img:regression.eps}) we try to fit a line (or other type of function) through the points in such a way that it describes the relationship between the two variables.

Fig.XX Regression

Once the relationship has been determined we can give a functional value to the following expression. In this way we can determine the value (or distribution) of y if we have the value for x. $P(y|x)=\frac{P(y,x)}{P(x)} = \frac{P(y,x)}{\int_{y}{P(y,x)dy}}$

### Classification

In classification we also have a set of points (xi,yi) for i = 1...n but we would like to use the x and y values to determine if a certain point belongs in group A or in group B. Consider the example in (Fig. \ref{img:Classification.eps}) where two sets of points have been divided into the set + and the set - by a line. The purpose of classification is to find this line and then place any new points into one group or the other.

Fig.XX Classify Points into Two Sets

We would like to obtain the probability distribution of to following equation where c is the class and x and y are the data points. In simple terms we would like to find the probability that this point is in class c when we know that the values of X and Y are x and y.

$P(c|x,y)=\frac{P(c,x,y)}{P(x,y)} = \frac{P(c,x,y)}{\sum_{c}{P(c,x,y)}}$

### Clustering

Clustering is somewhat like classification only that we do not know the groups before we gather and examine the data. We would like to find the probability distribution of the following equation without knowing the value of y.

$P(y|x)=\frac{P(y,x)}{P(x)}\ \ y\ unknown$

We can use graphs to represent the three types of statistical problems that have been introduced so far. The first graph (Fig. \ref{fig:RegClass} can be used to represent either the Regression or the Classification problem because both the X and the Y variables are known. The second graph (Fig. \ref{fig:Clustering}) we see that the value of the Y variable is unknown and so we can tell that this graph represents the Clustering situation.

Fig.XX Regression or classification
Fig.XX Clustering

Classification example: Naive Bayes classifier
First define a set of boolean random variables Xi and Y for i = 1...n.

$Y=\left\{1,0\right\}, X_i =\left\{1,0\right\}$

Then we will say that a certain pattern of Xs can either be classified as a 1 or a 0. The result of this classification will be represented by the variable Y. The graphical representation is shown in (Fig. \ref{img:classifi.eps}). One important thing to note here is that the two diagrams represent the same graph. The one on the right uses plate notation to simplify the representation of the graph for variables that are indexed. Such plate notation will also be used later in these notes.

\begin{tabular}{ccc} $\stackrel{x}{\underbrace{<01110> }_{n}}$ & $\rightarrow$ & $\stackrel{Y}{1}$
< 01110 > & $\rightarrow$ & 0 \end{tabular}

Fig.XX Two Types of Graphical Representation

We are interested in finding the following:

$\begin{matrix} P(y|x_1 .....x_n)=\frac{P(x_1....x_n|y)P(y)}{P(x_1.....x_n)} = \frac{P(x_1....x_n,y)}P(x_1.....x_n) = \frac{P(y)\prod_{i=1,2,..,n}{P(x_i|y)}}{P(x_1.....x_n)} \end{matrix}$

The classification is very intuitive in this case. We will calculate the probability that we are in class 1 and we will calculate the probability that we are in class 0. The higher probability will decide the class. For example if we have a higher probability of being in class 1 then we will place this set of Xs in class 1.

\begin{tabular}{ ccc } $\widehat{y}=1$ & $\Leftrightarrow$ & P(y = 1 | x1.....xn) > P(y = 0 | x1.....xn)
$\widehat{y}=1$ & $\Leftrightarrow$ & $\frac{P(y=1|x_1.....x_n)}{P(y=0|x_1.....x_n)} >1$
& $\Leftrightarrow$ & $\log{\frac{P(y=1)}{P(y=0)}} + \sum_{i=1..n}{\log{\frac{P(x_i|y=1)}{P(x_i|y=0)}}}>0$ \end{tabular}

Now if we define the following:
P(y = 1) = p
P(xi | y = 1) = Pi1
P(xi | y = 0) = Pi0

We can continue with the above simplification and we arrive at the solution:
\begin{tabular}{ ccc } $\widehat{y}=1$ & $\Leftrightarrow$ & $x_i\log{\frac{P_{i1}}{P_{i0}}}+ (1-x_i)\log{\frac{(1-P_{i1})}{(1-P_{i0})}} > 0$
& $\Leftrightarrow$ & $=x_i\underbrace{\log{\frac{P_{i1}(1-P_{i0})}{P_{i0}(1-P_{i1})}}}_{slope} + \underbrace{ \log{\frac{(1-P_{i1})}{(1-P_{i0})}} }_{intercept}$ \end{tabular}

## Example from last class

John is not a professional trader. However he trades in the copper market. Copper stock increase if demand for copper is more than supply, and decrease if supply is more than demand. Given supply and demand, the price of copper stock is not completely determined because some unknown factors such as prediction of political stability of countries, which supply copper or news about potential new use of copper, may impact the market.

If copper stock increases and John makes a right strategy, he will win; otherwise he will lose. Since John is not a professional trader sometimes he uses a bad trade strategy and in spite of increase of stock price he loses. S: A discrete variable which represents increasing or decreasing in copper supply.

D: A discrete variable which represents increasing or decreasing in copper demand.

C: A discrete variable which represents increasing or decreasing in stack price.

P: A discrete variable that shows whether John wins or loses in his trade.

J: A discrete variable which is 1 when John makes a right choice in his trade strategy and 0 otherwise.

Fig.XX

p(S=1)=0.6, p(D=1)=0.7, p(J=1)=0.4
\begin{tabular}{|c|c|}

 \hline
% after : \hline or \cline{col1-col2} \cline{col3-col4} ...
S D & p(c=1)
\hline
1 1 & 0.5
\hline
1 0 & 0.1
\hline
0 1 & 0.85
\hline
0 0 & 0.5
\hline


\end{tabular} \begin{tabular}{|c|c|}

 \hline
% after : \hline or \cline{col1-col2} \cline{col3-col4} ...
J C & p(p=1)
\hline
1 1 & 0.85
\hline
1 0 & 0.5
\hline
0 1 & 0.2
\hline
0 0 & 0.1
\hline


\end{tabular} $p(S,D,C,J,P) = p(S)p(D)p(J)p(C|S,D)p(P|J,C)$ \end{comment}

### Bayesian and Frequentist Statistics

There are two approaches of parameter estimation: the Bayesian and the Frequentist. This section focuses on the distinctions between these two approaches. We begin with a simple example,
Example:
Consider the following table of 1s and 2s. We would like to teach the computer to distinguish between the two sets of numbers so that when a person writes down a number the computer can use a statistical tool to decide if the written digit is a 1 or a 2.

\begin{tabular}{|c|c|c|}

 \hline
θ & 1 & 2
\hline
X & 1 & 2
\hline
X & 1 & 2
\hline
X & 1 & 2
\hline


\end{tabular}

The question that arises is: Given a written number what is the probability that that number belongs to the group of ones and what is the probability that that number belongs to the group of twos. In the Frequentist approach we use p(x | θ). We view the model p(x | θ) as a conditional probability distribution. Here, θ is known and X is unknown. However, Bayesian approach views X as known and θ as unknown, which gives

$p(\theta|x) = \frac {p(x|\theta)p(\theta)}{p(x)}$

Where p(θ | x) is the posterior probability , p(x | θ) is likelihood, and p(θ) is the prior probability of the parameter. There are some important assumptions about this equation. First, we view θ as a random variable. This is characteristic of the Bayesian approach, which is that all unknown quantities are treated as random variables. Second, we view the data x as a quantity to be conditioned on. Our inference is conditional on the event {X = x}. Third, in order to calculate p(θ | x) we need p(θ). Finally, note that Bayes rule yields a distribution over θ, not a single estimate of θ.

The Frequentist approach tries to avoid the use of prior probabilities. The goal of Frequentist methodology is to develop an "objective" statistical theory, in which two statisticians employing the methodology must necessarily draw the same conclusions from a particular set of data.

Consider a coin-tossing experiment as an example. The model is the Bernoulli distribution, p(x | θ) = θx(1 − θ)1 − x. Bayesian approach requires us to assign a prior probability to θ before observing the outcome from tossing the coin. Different conclusions may be obtained from the experiment if different priors are assigned to θ. The Frequentist statistician wishes to avoid such "subjectivity". From another point of view, a Frequentist may claim that θ is a fixed property of the coin, and that it makes no sense to assign probability to it. A Bayesian would believe that p(θ | x) represents the statistician's uncertainty about the value of θ. Bayesian statistics views the posterior probability and the prior probability alike as subjective.

### Maximum Likelihood Estimator

There is one particular estimator that is widely used in Frequentist statistics, namely the maximum likelihood estimator. Recall that the probability model p(x | θ) has the intuitive interpretation of assigning probability to X for each fixed value of θ. In the Bayesian approach this intuition is formalized by treating p(x | θ) as a conditional probability distribution. In the Frequentist approach, however, we treat p(x | θ) as a function of θ for fixed x, and refer to p(x | θ) as the likelihood function. $\hat{\theta}_{ML}=argmax_{\theta}p(x|\theta)$ where p(x | θ) is the likelihood L(θ,x) $\hat{\theta}_{ML}=argmax_{\theta}log(p(x|\theta))$ where log(p(x | θ)) is the log likelihood l(θ,x)

Since p(x) in the denominator of Bayes Rule is independent of θ we can consider it as a constant and we can draw the conclusion that:

$p(\theta|x) \propto p(x|\theta)p(\theta)$

Symbolically, we can interpret this as follows:

$Posterior \propto likelihood \times prior$

where we see that in the Bayesian approach the likelihood can be viewed as a data-dependent operator that transforms between the prior probability and the posterior probability.

### Connection between Bayesian and Frequentist Statistics

Suppose in particular that we force the Bayesian to choose a particular value of θ; that is, to remove the posterior distribution p(θ | x) to a point estimate. Various possibilities present themselves; in particular one could choose the mean of the posterior distribution or perhaps the mode.

(i) the mean of the posterior (expectation):

$\hat{\theta}_{Bayes}=\int \theta p(\theta|x)\,d\theta$

is called Bayes estimate.

OR

(ii) the mode of posterior:

$\begin{matrix} \hat{\theta}_{MAP}&=&argmax_{\theta} p(\theta|x) \\ &=&argmax_{\theta}p(x|\theta)p(\theta) \end{matrix}$

Note that MAP is \textsl{Maximum a posterior}.

$MAP -------> \hat\theta_{ML}$

When the prior probabilities, p(θ) is taken to be uniform on θ, the MAP estimate reduces to the maximum likelihood estimate, $\hat{\theta}_{ML}$.

MAP = argmaxθp(x | θ)p(θ)

When the prior is not taken to be uniform, the MAP estimate will be the maximization over probability distributions(the fact that the logarithm is a monotonic function implies that it does not alter the optimizing value).

Thus, one has:

$\hat{\theta}_{MAP}=argmax_{\theta} \{ log p(x|\theta) + log p(\theta) \}$

as an alternative expression for the MAP estimate.

Here, log(p(x | θ)) is log likelihood and the "penalty" is the additive term log(p(θ)). Penalized log likelihoods are widely used in Frequentist statistics to improve on maximum likelihood estimates in small sample settings.

#### Information for an Event

Consider that we have a given event E. The event has a probability P(E). As the probability of that event decreases we say that we have more information about that event. We calculate the information as:

$Information = log (\frac{1}{P(E)}) = - log (P(E))$

#### Binomial Example

Probability Example:
Consider the set of observations $x = (x_1, x_2, \cdots, x_n)$ which are iid, where $x_1, x_2, \cdots, x_n$ are the different observations of X. We can also say that this random variable is parameterized by a θ such that:

$P(X|\theta) \equiv P_{\theta}(x)$

In our example we will use the following model:

P(xi = 1) = θ
P(xi = 0) = 1 − θ
$P(x|\theta) = \theta^{x_i}(1-\theta)^{(1-x_i)}$
where
xi = {0,1}

Suppose now that we also have some data D:
e.g. $D = \left\lbrace 1,1,0,1,0,0,0,1,1,1,1,\cdots,0,1,0 \right\rbrace$
We want to use this data to estimate θ.

We would now like to use the ML technique. To do this we can construct the following graphical model:

Fig.XX

Fig.XX

Since all of the variables are iid then there are no dependencies between the variables and so we have no edges from one node to another.

Fig.XX

How do we find the joint probability distribution function for these variables? Well since they are all independent we can just multiply the marginal probabilities and we get the joint probability.

$L(\theta;x) = \prod_{i=1}^n P(x_i|\theta)$

This is in fact the likelihood that we want to work with. Now let us try to maximise it:

$\begin{matrix} l(\theta;x) & = & log(\prod_{i=1}^n P(x_i|\theta)) \\ & = & \sum_{i=1}^n log(P(x_i|\theta) \\ & = & \sum_{i=1}^n log(\theta^{x_i}(1-\theta)^{1-x_i}) \\ & = & \sum_{i=1}^n x_ilog(\theta) + \sum_{i=1}^n (1-x_i)log(1-\theta) \\ \end{matrix}$

Take the derivative and set it to zero:

$\frac{\partial l}{\partial\theta} = 0$
$\frac{\partial l}{\partial\theta} = \sum_{i=0}^{n}\frac{x_i}{\theta} - \sum_{i=0}^{n}\frac{1-x_i}{1-\theta} = 0$
$\Rightarrow \frac{\sum_{i=0}^{n}x_i}{\theta} = \frac{\sum_{i=0}^{n}(1-x_i)}{1-\theta}$
$\frac{H}{\theta} = \frac{T}{1-\theta}$

Where:

\begin{center} H = \# of all xi = 1, e.g. \# of heads

              T = \# of all xi = 0, e.g. \# of tails
Hence, T + H = n


\end{center}

And now we can solve for θ:

$\begin{matrix} \theta & = & \frac{(1-\theta)H}{T} \\ \theta + \theta\frac{H}{T} & = & \frac{H}{T} \\ \theta(\frac{T+H}{T}) & = & \frac{H}{T} \\ \theta & = & \frac{\frac{H}{T}}{\frac{n}{T}} = \frac{H}{n} \end{matrix}$

#### Univariate Normal

Now let us assume that the observed values come from normal distribution.
\includegraphics{images/fig4Feb6.eps} \newline Our new model looks like:

$P(x_i|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}$

Now to find the likelihood we once again multiply the independent marginal probabilities to obtain the joint probability and the likelihood function.

$L(\theta;x) = \prod_{i=1}^{n}\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}}$
$\max_{\theta}l(\theta;x) = \max_{\theta}\sum_{i=1}^{n}(-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^{2}+log\frac{1}{\sqrt{2\pi}\sigma}$

Now, since our parameter theta is in fact a set of two parameters,

θ = (μ,σ)

we must estimate each of the parameters separately.

$\frac{\partial}{\partial u} = \sum_{i=1}^{n} \left( \frac{\mu - x_i}{\sigma} \right) = 0 \Rightarrow \hat{\mu} = \frac{1}{n}\sum_{i=1}^{n}x_i$
$\frac{\partial}{\partial \mu ^{2}} = -\frac{1}{2\sigma ^4} \sum _{i=1}^{n}(x_i-\mu)^2 + \frac{n}{2} \frac{1}{\sigma ^2} = 0$
$\Rightarrow \hat{\sigma} ^2 = \frac{1}{n}\sum_{i=1}{n}(x_i - \hat{\mu})^2$

#### Bayesian

Now we can take a look at the Bayesian approach to the same problem. Assume θ is a random variable, and we want to find P(θ | x). Also, assume θ is the mean and variance of a Gaussian distribution like in the previous example.

The graphical model is shown in Figure \ref{fig:fig5Feb6}.

Fig.XX Graphical Model for Mean
$P(\mu | x) = \frac{P(x|\mu)P(\mu)}{P(x)}$

We can begin with the estimation of μ. If we assume μ as uniform, then we become a Frequentist and the result matches the one from the ML estimation. But, if we assume μ is normal, then we get an interesting result.

Assume μ as normal, then

$\mu \thicksim N(\mu _{0}, \tau)$
$P(x, \mu) = \prod_{i=1}^{n}P(x_i|\mu)P(\mu)$

We want to find P(μ | x) and take expectation.

$P(\mu | x) = \frac{1}{\sqrt{2\pi}\hat{\sigma}}e^(-\frac{1}{2})(\frac{x-\hat{\mu}}{\hat{\sigma}})^2$

Where

$\hat{\mu} = \frac{\frac{n}{\sigma}^{2}}{\frac{n}{\sigma ^ 2} + \frac{1}{\tau ^ 2}}\hat{x} + \frac{\frac{1}{\tau ^ 2}}{\frac{n}{\sigma ^2} + \frac{1}{\tau ^2}}\mu _0$

is a linear combination of the sample mean and the mean of the prior.

$\lim_{x \rightarrow \infty}\hat{\mu} = \hat{x} = \frac{\sum_{i=1}^{n}x_i}{n}$

P(μ | x) shows a distribution of μ, not just a single value. Also if we were to do the calculations for the sigma we would find the following result:

$(\hat{\sigma})^{2} = (\frac{n}{\sigma ^{2}} + \frac{1}{\tau^{2}})^{-1}$

#### ML Estimate for Completely Observed Graphical Models

For a given graph G(V, E) each node represents a random variable. We can observe these variables and write down data for each one. If for example we had n nodes in the graph one observation would be (x1,x2,...,xn). We can consider that these observations are independent and identically distributed. Note that xi is not necessarily independent from xj.

Directed Graph Example
Consider the following directed graph (Fig. \ref{img:DirGraphObs.eps}).

Fig.XX Our Directed Graph

We can assume that we have made a number of observations, say n, for each of the random variables in this graph.
\begin{tabular}{ccccc} Observation & X1 & X2 & X3 & X4
1 & x11 & x12 & x13 & x14
2 & x21 & x22 & x23 & x24
3 & x31 & x32 & x33 & x34
& & ... & &
n & xn1 & xn2 & xn3 & xn4 \end{tabular}

Armed with this new information we would like to estimate θ = (θ1234).
We know from before that we can write the joint distribution function as:

P(x | θ) = P(x1)P(x2 | x1)P(x3 | x1)P(x4 | x2,x3)

Which means that our likelihood function is:

 L(θ,x) = ∏ P(xi1 | θ1)P(xi2 | xi1,θ2)P(xi3 | xi1,θ3)P(xi4 | xi2,xi3,θ4) i = 1..n

And our log likelihood is:

 l(θ,x) = ∑ log(P(xi1 | θ1)) + log(P(xi2 | xi1,θ2)) + log(P(xi3 | xi1,θ3)) + log(P(xi4 | xi2,xi3,θ4)) i = 1..n

To maximise θ we must maximise each of the θi individually. The good thing is that each of our parameters appears in a different term and so the maximization of each θi can be carried out independently of the others.
For discrete random variables we can use Bayes Rule. For example:

$\begin{matrix} P(x_2=1|x_1=1) & = & \frac{P(x_2=1,x_1=1)}{P(x_1=1)} \\ & = & \frac{Number\ of\ times\ x_1\ and\ x_2\ are\ 1}{Number\ of\ times\ x_1\ is\ 1} \end{matrix}$

Intuitively, this means that we count the number of times that both of the variables satisfy their conditions and then divide by the number of times that only one of them satisfies the condition. Then we know what proportion of time the variables satisfy the conditions together. The proportion is in fact the θi we are looking for.
We can consider another example. We can try to find:

P(x4 | x3,x2)

\begin{tabular}{cccc} x3 & x2 & P(x4 = 0 | x3,x2) & P(x4 = 1 | x3,x2)
0 & 0 & θ400 & 1 − θ400
0 & 1 & θ401 & 1 − θ401
1 & 0 & θ410 & 1 − θ410
1 & 1 & θ411 & 1 − θ411 \end{tabular}

For the exponential family of distributions there is a general formula for the ML estimates but it does not have a closed form solution. To get around this, one can use the Interactive Reweighted Least Squares (IRLS) method also called the Newton Raphson method to find these parameters.

In the case of the undirected model things get a little more complicated. The θis do not decouple and so they can not be calculated separately. To solve this we can use KL divergence which is a method that considers the distance between two distributions.

## EM Algorithm

Let us once again consider the above example only this time the data that was supposed to be collected was not done so properly. Instead of having complete data about every random variable at every step some data points are missing.

\begin{tabular}{ccccc} Observation & X1 & X2 & X3 & X4
1 & x11 & x12 & Z13 & x14
2 & x21 & x22 & x23 & x24
3 & Z31 & x32 & x33 & x34
4 & Z41 & x42 & x43 & Z44
& & ... & &
n & xn1 & xn2 & xn3 & xn4 \end{tabular}

In the above table the x values represent data as before and the Z values represent missing data (sometimes called latent data) at that point. Now the question here is how do we calculate the values of the parameters θi if we do not have all the data we need. We can use the Expectation Maximization (or EM) Algorithm to estimate the parameters for the model even though we do not have a complete data set.
One thing to note here is that in the case of missing values we now have multiple local maxima in the likelihood function and as a result the EM Algorithm does not always reach the global maximum. Instead it may find one of a number of local maxima. Multiple runs of the EM Algorithm with different starting values will possibly produce different results since it may reach a different local maxima.
Define the following types of likelihoods:
complete log likelihood = lc(θ;x,z) = log(P(x,z | θ)).
incomplete log likelihood = l(θ;x) = log(P(x | θ)).

### Derivation of EM

We can rewrite the incomplete likelihood in terms of the complete likelihood. This equation is in fact the discrete case but to convert to the continuous case all we have to do is turn the summation into an integral.

 l(θ;x) = log(P(x | θ)) = log( ∑ P(x,z | θ)) z

Since the z has not been observed that means that lc is in fact a random quantity. In that case we can define the expectation of lc in terms of some arbitrary density function q(z | x).

$E[{l_c(\theta, x, z)}_q] = \sum_z q(z|x)log(P(x, z|\theta))$

#### Jensen's Inequality

In order to properly derive the formula for the EM algorithm we need to first introduce the following theorem.

For any convex function f:

$f(\alpha x_1 + (1-\alpha)x_2) \leqslant \alpha f(x_1) + (1-\alpha)f(x_2)$

This can be shown intuitively through a graph. In the (Fig. \ref{img:JensenIneq.eps}) point A is the point on the function f and point B is the value represented by the right side of the inequality. On the graph one can see why point A will be smaller than point B in a convex graph.

Fig.XX Jensen's Inequality

For us it is important that the log function is concave and so we must inverse the sign on the equation. Jensen's inequality is used in step (\ref{UseJensen}) of the EM derivation but for the concave log function.

#### Derivation

$\begin{matrix} l(\theta, x) & = & log(\sum_z P(x,z|\theta)) \\ & = & log(\sum_z q(z|x) \frac{P(x,z|\theta)}{q(z|x)}) \\ & \geqslant & \sum_z q(z|x)log(\frac{P(x,z|\theta)}{q(z|x)}) \\ & = & \mathfrak{L}(q;\theta) \end{matrix}$

The function $\mathfrak{L}(q;\theta)$ is called the axillary function and it is used in the EM algorithm. For the EM algorithm we have two steps that we repeat one after the other in order to get better estimates for q(z | x) and θ. As the steps are repeated the parmeters converge to a local maximum in the likelihood function.

E-Step

$argmax_{q} \mathfrak{L}(q;\theta^{(t)}) = q^{(t+1)}$

M-Step

$argmax_{\theta} \mathfrak{L}(q^{(t+1)};\theta) = \theta^{(t+1)}$

$\begin{matrix} \mathfrak{L}(q;\theta) & = & \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)}) \\ & = & \sum_z q(z|x)log(P(x,z|\theta)) - \underbrace{\sum_z q(z|x)log(q(z|x))}_\mbox{Constant with respect to} \theta \\ & = & E[ l_c(\theta;x, y) ] \end{matrix}$

Since the second part of the equation is only a constant with respect to θ, in the M-step we only need to maximise the expectation of the complete likelihood. The complete likelihood is the only part that still depends on θ.

In this step we are trying to find an estimate for q(z | x). To do this we have to maximise $\mathfrak{L}(q;\theta^{(t)})$.

$\mathfrak{L}(q;\theta^{t}) = \sum_z q(z|x) log(\frac{P(x,z|\theta)}{q(z|x)})$

It can be shown that q(z | x) = P(z | x(t)). So, replace q(z | x) with P(z | x(t)).

$\begin{matrix} \mathfrak{L}(q;\theta^{t}) & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(x,z|\theta)}{P(z|x,\theta^{(t)})}) \\ & = & \sum_z P(z|x,\theta^{(t)}) log(\frac{P(z|x,\theta^{(t)})P(x|\theta^{(t)})}{P(z|x,\theta^{(t)})}) \\ & = & \sum_z P(z|x,\theta^{(t)}) log(P(x|\theta^{(t)})) \\ & = & log(P(x|\theta^{(t)})) \\ & = & l(\theta; x) \end{matrix}$

But $\mathfrak{L}(q;\theta^{(t)})$ is the lower bound of l(θ,x) so that means that P(z | x(t)) is in fact the maximum for $\mathfrak{L}$. We can therefore see that we only need to do the E-Step once and then we can use that result for each repetition of the M-Step.

From the above results we can find that we have an alternative representation for the EM algorithm. We can reduce it to:

E-Step
Find E[lc(θ;x,z)]P(z | x,θ) only once.
M-Step
Maximise E[lc(θ;x,z)]P(z | x,θ) with respect to theta.

The EM Algorithm is probably best understood through examples.

#### EM Algorithm Example

Suppose we have the two independent and identically distributed random variables:

Y1,Y2˜P(y | θ) = θe − θy

In our case y1 = 5 has been observed but y2 = ? has not. Our task is to find an estimate for θ. We will try to solve the problem first without the EM algorithm. Luckily this problem is simple enough to be solveable without the need for EM.

$\begin{matrix} L(\theta; Data) & = & \theta e^{-5\theta} \\ l(\theta; Data) & = & log(\theta)- 5\theta \end{matrix}$

We take our derivative:

$\begin{matrix} & \frac{dl}{d\theta} & = 0 \\ \Rightarrow & \frac{1}{\theta}-5 & = 0 \\ \Rightarrow & \theta & = 0.2 \end{matrix}$

And now we can try the same problem with the EM Algorithm.

$\begin{matrix} L(\theta; Data) & = & \theta e^{-5\theta}\theta e^{-y_2\theta} \\ l(\theta; Data) & = & 2log(\theta) - 5\theta - y_2\theta \end{matrix}$

E-Step

$E[l_c(\theta; Data)]_{P(y_2|y_1, \theta)} = 2log(\theta) - 5\theta - \frac{\theta}{\theta^{(t)}}$

M-Step

$\begin{matrix} & \frac{dl_c}{d\theta} & = 0 \\ \Rightarrow & \frac{2}{\theta}-5 - \frac{1}{\theta^{(t)}} & = 0 \\ \Rightarrow & \theta^{(t+1)} & = \frac{2\theta^{(t)}}{5\theta^{(t)}+1} \end{matrix}$

Now we pick an initial value for θ. Usually we want to pick something reasonable. In this case it does not matter that much and we can pick θ = 10. Now we repeat the M-Step until the value converges.

$\begin{matrix} \theta^{(1)} & = & 10 \\ \theta^{(2)} & = & 0.392 \\ \theta^{(3)} & = & 0.2648 \\ ... & & \\ \theta^{(k)} & \simeq & 0.2 \end{matrix}$

And as we can see after a number of steps the value converges to the correct answer of 0.2. In the next section we will discuss a more complex model where it would be difficult to solve the problem without the EM Algorithm.

### Mixture Models

In this section we discuss what will happen if the random variables are not identically distributed. The data will now sometimes be sampled from one distribution and sometimes from another.

#### Mixture of Gaussian

Given P(x | θ) = αN(x11) + (1 − α)N(x22). We sample the data, Data = {x1,x2...xn} and we know that x1,x2...xn are iid. from P(x | θ).
We would like to find:

θ = {α,μ1122}

We have no missing data here so we can try to find the parameter estimates using the ML method.

 L(θ;Data) = ∏ = 1...n(αN(xi,μ1,σ1) + (1 − α)N(xi,μ2,σ2)) i

And then we need to take the log to find l(θ,Data) and then we take the derivative for each parameter and then we set that derivative equal to zero. That sounds like a lot of work because the Gaussian is not a nice distribution to work with and we do have 5 parameters.
It is actually easier to apply the EM algorithm. The only thing is that the EM algorithm works with missing data and here we have all of our data. The solution is to introduce a latent variable z. We are basically introducing missing data to make the calculation easier to compute.

zi = 1 with prob. α
zi = 0 with prob. (1 − α)

Now we have a data set that includes our latent variable zi:

Data = {(x1,z1),(x2,z2)...(xn,zn)}

We can calculate the joint pdf by:

P(xi,zi | θ) = P(xi | zi,θ)P(zi | θ)

Let,

P(x_i|z_i,\theta)=


\left\{ \begin{tabular}{l l l} φ1(xi) = N(x11) & if & zi = 1
φ2(xi) = N(x22) & if & zi = 0 \end{tabular} \right. Now we can write

$P(x_i|z_i,\theta)=\phi_1(x_i)^{z_i} \phi_2(x_i)^{1-z_i}$

and

$P(z_i)=\alpha^{z_i}(1-\alpha)^{1-z_i}$

We can write the joint pdf as:

$P(x_i,z_i|\theta)=\phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i}$

From the joint pdf we can get the likelihood function as:

$L(\theta;D)=\prod_{i=1}^n \phi_1(x_i)^{z_i}\phi_2(x_i)^{1-z_i}\alpha^{z_i}(1-\alpha)^{1-z_i}$

Then take the log and find the log likelihood:

$l_c(\theta;D)=\sum_{i=1}^n z_i log\phi_1(x_i) + (1-z_i)log\phi_2(x_i) + z_ilog\alpha + (1-z_i)log(1-\alpha)$

In the E-step we need to find the expectation of lc

$E[l_c(\theta;D)] = \sum_{i=1}^n E[z_i]log\phi_1(x_i)+(1-E[z_i])log\phi_2(x_i)+E[z_i]log\alpha+(1-E[z_i])log(1-\alpha)$

For now we can assume that < zi > is known and assign it a value, let < zi > = wi
In M-step, we have to update our data by assuming the expectation is fixed

θ(t + 1) < − − argmaxθE[lc(θ;D)]

Taking partial derivatives of the complete log likelihood with respect to the parameters and set them equal to zero, we get our estimated parameters at (t+1).

$\begin{matrix} \frac{d}{d\alpha} = 0 \Rightarrow & \sum_{i=1}^n \frac{w_i}{\alpha}-\frac{1-w_i}{1-\alpha} = 0 & \Rightarrow \alpha=\frac{\sum_{i=1}^n w_i}{n} \\ \frac{d}{d\mu_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(x_i-\mu_1)=0 & \Rightarrow \mu_1=\frac{\sum_{i=1}^n w_ix_i}{\sum_{i=1}^n w_i} \\ \frac{d}{d\mu_2}=0 \Rightarrow & \sum_{i=1}^n (1-w_i)(x_i-\mu_2)=0 & \Rightarrow \mu_2=\frac{\sum_{i=1}^n (1-w_i)x_i}{\sum_{i=1}^n (1-w_i)} \\ \frac{d}{d\sigma_1} = 0 \Rightarrow & \sum_{i=1}^n w_i(-\frac{1}{2\sigma_1^{2}}+\frac{(x_i-\mu_1)^2}{2\sigma_1^4})=0 & \Rightarrow \sigma_1=\frac{\sum_{i=1}^n w_i(x_i-\mu_1)^2}{\sum_{i=1}^n w_i} \\ \frac{d}{d\sigma_2} = 0 \Rightarrow & \sum_{i=1}^n (1-w_i)(-\frac{1}{2\sigma_2^{2}}+\frac{(x_i-\mu_2)^2}{2\sigma_2^4})=0 & \Rightarrow \sigma_2=\frac{\sum_{i=1}^n (1-w_i)(x_i-\mu_2)^2}{\sum_{i=1}^n (1-w_i)} \end{matrix}$

We can verify that the results of the estimated parameters all make sense by considering what we know about the ML estimates from the standard Gaussian. But we are not done yet. We still need to compute < zi > = wi in the E-step.

$\begin{matrix} & = & E_{z_i|x_i,\theta^{(t)}}(z_i) \\ & = & \sum_z z_i P(z_i|x_i,\theta^{(t)}) \\ & = & 1\times P(z_i=1|x_i,\theta^{(t)}) + 0\times P(z_i=0|x_i,\theta^{(t)}) \\ & = & P(z_i=1|x_i,\theta^{(t)}) \\ P(z_i=1|x_i,\theta^{(t)}) & = & \frac{P(z_i=1,x_i|\theta^{(t)})}{P(x_i|\theta^{(t)})} \\ & = & \frac {P(z_i=1,x_i|\theta^{(t)})}{P(z_i=1,x_i|\theta^{(t)}) + P(z_i=0,x_i|\theta^{(t)})} \\ & = & \frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})} \end{matrix}$

We can now combine the two steps and we get the expectation

$E[z_i] =\frac{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) }{\alpha^{(t)}N(x_i,\mu_1^{(t)},\sigma_1^{(t)}) +(1-\alpha^{(t)})N(x_i,\mu_2^{(t)},\sigma_2^{(t)})}$

Using the above results for the estimated parameters in the M-step we can evaluate the parameters at (t+2),(t+3)...until they converge and we get our estimated value for each of the parameters.

The mixture model can be summarized as:

• In each step, a state will be selected according to p(z).
• Given a state, a data vector is drawn from p(x | z).
• The value of each state is independent from the previous state.

A good example of a mixture model can be seen in this example with two coins. Assume that there are two different coins that are not fair. Suppose that the probabilities for each coin are as shown in the table.
\begin{tabular}{|c|c|c|}

 \hline
& H & T
coin1 & 0.3 & 0.7
coin2 & 0.1 & 0.9
\hline


\end{tabular}
We can choose one coin at random and toss it in the air to see the outcome. Then we place the con back in the pocket with the other one and once again select one coin at random to toss. The resulting outcome of: HHTH \dots HTTHT is a mixture model. In this model the probability depends on which coin was used to make the toss and the probability with which we select each coin. For example, if we were to select coin1 most of the time then we would see more Heads than if we were to choose coin2 most of the time.

## Hidden Markov Models

In a Hidden Markov Model (HMM) we consider that we have two levels of random variables. The first level is called the hidden layer because the random variables in that level cannot be observed. The second layer is the observed or output layer. We can sample from the output layer but not the hidden layer. The only information we know about the hidden layer is that it affects the output layer. The HMM model can be graphed as shown in Figure \ref{fig:HMM}.

Fig.XX Hidden Markov Model

In the model the qis are the hidden layer and the yis are the output layer. The yis are shaded because they have been observed. The parameters that need to be estimated are θ = (π,A,η). Where π represents the starting state for q0. In general πi represents the state that qi is in. The matrix A is the transition matrix for the states qt and qt + 1 and shows the probability of changing states as we move from one step to the next. Finally, η represents the parameter that decides the probability that yi will produce y * given that qi is in state q * .
For the HMM our data comes from the output layer:

Data = (y0i,y1i,y2i,...,yTi) for i = 1...n

We can now write the joint pdf as:

$P(q, y) = p(q_0)\prod_{t=0}^{T-1}P(q_{t-1}|q_t)\prod_{t=0}^{T}P(y_t|q_t)$

We can use aij to represent the i,j entry in the matrix A. We can then define:

$P(q_{t-1}|q_t) = \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}}$

We can also define:

$p(q_0) = \prod_{i=1}^M (\pi_i)^{q_0^i}$

Now, if we take Y to be multinomial we get:

$P(y_t|q_t) = \prod_{i,j=1}^M (\eta_{ij})^{y_t^i q_t^j}$

The random variable Y does not have to be multinomial, this is just an example. We can combine the first two of these definitions back into the joint pdf to produce:

$P(q, y) = \prod_{i=1}^M (\pi_i)^{q_0^i}\prod_{t=0}^{T-1} \prod_{i,j=1}^M (a_{ij})^{q_i^t q_j^{t+1}} \prod_{t=0}^{T}P(y_t|q_t)$

We can go on to the E-Step with this new joint pdf. In the E-Step we need to find the expectation of the missing data given the observed data and the initial values of the parameters. Suppose that we only sample once so n = 1. Take the log of our pdf and we get:

$l_c(\theta, q, y) = \sum_{i=1}^M {q_0^i}log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M {q_i^t q_j^{t+1}} log(a_{ij}) \sum_{t=0}^{T}log(P(y_t|q_t))$

Then we take the expectation for the E-Step:

$E[l_c(\theta, q, y)] = \sum_{i=1}^M E[q_0^i]log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M E[q_i^t q_j^{t+1}] log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))]$

If we continue with our multinomial example then we would get:

$\sum_{t=0}^{T}E[log(P(y_t|q_t))] = \sum_{t=0}^{T}\sum_{i,j=1}^M E[q_t^j] y_t^i log(\eta_{ij})$

So now we need to calculate $E[q_0^i]$ and $E[q_i^t q_j^{t+1}]$ in order to find the expectation of the log likelihood. Let's define some variables to represent each of these quantities.
Let $\gamma_0^i = E[q_0^i] = P(q_0^i=1|y, \theta^{(t)})$.
Let $\xi_{t,t+1}^{ij} = E[q_i^t q_j^{t+1}] = P(q_t^iq_{t+1}^j|y, \theta^{(t)})$ .
We could use the sum product algorithm to calculate these equations but in this case we will introduce a new algorithm that is called the α - β Algorithm.

### The α - β Algorithm

We have from before the expectation:

$E[l_c(\theta, q, y)] = \sum_{i=1}^M \gamma_0^i log(\pi_i)\sum_{t=0}^{T-1} \sum_{i,j=1}^M \xi_{t,t+1}^{ij} log(a_{ij}) \sum_{t=0}^{T}E[log(P(y_t|q_t))]$

As usual we take the derivative with respect to θ and then we set that equal to zero and solve. We obtain the following results (You can check these...) . Note that for η we are using a specific y * that is given.

$\begin{matrix} \hat \pi_0 & = & \frac{\gamma_0^i}{\sum_{k=1}^M \gamma_0^k} \\ \hat a_{ij} & = & \frac{\sum_{t=0}^{T-1}\xi_{t,t+1}^{ij}}{\sum_{k=1}^M\sum_{t=0}^{T-1}\xi_{t,t+1}^{ij}} \\ \hat \eta_i(y^*) & = & \frac{\sum_{t|y_t=y^*}\gamma_t^i}{\sum_{t=0}^T\gamma_t^i} \end{matrix}$

For η we can think of this intuitively. It represents the proportion of times that state i prodices y * . For example we can think of the multinomial case for y where:

$\hat \eta_{ij} = \frac{\sum_{t=0}^T\gamma_t^i y_t^j}{\sum_{t=0}^T\gamma_t^i}$

Notice here that all of these parameters have been solved in terms of $\gamma_t^i$ and $\xi_{t,t+1}^{ij}$. If we were to be able to calculate those two parameters then we could calculate everything in this model. This is where the α - β Algorithm comes in.

$\begin{matrix} \gamma_t^i & = & P(q_t^i = 1|y) \\ & = & \frac{P(y|q_t)P(q_t)}{P(y)} \end{matrix}$

Now due to the Markovian Memoryless property.

$\begin{matrix} \gamma_t^i & = & \frac{P(y_0...y_t|q_t)P(y_{t+1}...y_T|q_t)P(q_t)}{P(y)} \\ & = & \frac{P(y_0...y_t|q_t)P(q_t)P(y_{t+1}...y_T|q_t)}{P(y)} \\ & = & \frac{P(y_0...y_t, q_t)P(y_{t+1}...y_T|q_t)}{P(y)} \end{matrix}$

Define α and β as follows:

α(qt) = P(y0...yt,qt)
β(qt) = P(yt + 1...yT | qt)

Once we have α and β then computing P(y) is easy.

$P(y) = \sum_{q_t}\alpha(q_t)\beta(q_t)$

To calculate α and β themselves we can use:
For α:

$\alpha(q_{t+1}) = \sum_{q_t}\alpha(q_t)a_{q_t,q_{t+1}}P(y_{t+1}|q_{t+1})$

Where we begin with:

α(q0) = P(y0,q0) = P(y0 | q00

Then for β:

$\beta(q_t) = \sum_{q_t}\beta(q_{t+1})a_{q_t,q_{t+1}}P(y_{t+1}|q_{t+1})$

Where we now begin from the other end:

β(qT) = (1,1,.....1) = A Vector of Ones

Once both α and β have been calculated we can use them to find:

$\gamma_t^i = \frac{\alpha(q_t)\beta(q_t)}{\sum_{q_t}\alpha(q_t)\beta(q_t)}$
$\xi_{t,t+1}^{ij} = \frac{\alpha(q_t)P(y_{t+1}, q_{t+1}) \beta(q_{t+1}) a_{q_t,q_{t+1}}}{P(y)}$

## Sampling Methods

A fundamental problem in statistics has always been to find the expectation of f(x) with respect to P(x).

$E[f] = \int f(x)P(x) dx$

In many cases this integral is quite difficult to compute directly and so certain methods have been developed in an attempt to estimate the value without the need to actually do the integration. One such method is the Monte Carlo method where the integral is estimated by a sum.

$\hat f = \frac{1}{n}\sum_{i=1}^n f(x_i) \text{ where } x_i \sim P(x)$

We can also find the mean and standard deviation for the estimate. In fact, the mean is the correct mean for f(x).

$E[\hat f] = E[f]$
$\sigma_{\hat f} = \frac{\sigma}{\sqrt{n}}$
σ2 = E[(fE[f])2]

So the only setback is that we have to be able to sample from P(x).

### Sampling from Uniform

Let us assume that we want to sample from UNIF(0, 1). How would we go about doing this? Sampling from a uniform distribution that is truly random is very difficult. We are only going to look at the way it is done on a computer. On a computer we have a function that looks something like $D \equiv ax + b\ mod\ m$ for some constants a, b and m. The choice of a, b and m is very important for the simulation of random numbers to work. The computer is also provided with a seed which will become the first term of the sequence seed = x0. The seed is usually chosen from the CPU clock. After that every 'random' number is generated by D(xi) = xi + 1. If one were to know the seed and the constants a, b and m then the series of 'random' numbers could be predicted exactly. That is why we call random numbers that are generated by a computer Pseudo Random Numbers.
For the rest of this section we will assume that we know how to draw from a uniform distribution. It will provide us with the 'randomness' that is needed by each of our algorithms.

### Inverse Method for Sampling

This is a two step method:
Step 1: Draw u˜UNIF(0,1).
Step 2: Compute x = F − 1(u) where $F = \int^\infty_{-\infty} {P(u)du}$.
Example:
Suppose that we want to draw a sample from P(x) = θe − θx where x > 0. We need to first find F(x) and then F − 1.

$F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x}$
$F^{-1}(x) = \frac{-log(1-y)}{\theta}$

Now we can generate our random sample i = 1...n from P(x) by:

$1)\ u_i \sim UNIF(0,1)$
$2)\ x_i = \frac{-log(1-u_i)}{\theta}$

The xi are now a random sample from P(x).
The major problem with this approach is that we have to find F − 1 and for many distributions, such as the Gaussian for instance, it is too difficult to find the inverse of F(x).

$F(x) = \int_{-\infty}^x \frac{1}{2\pi}e^{\frac{-u^2}{2}}$

Here F − 1(x) is too hard to compute.

### Box-Muller

This is a method for sampling from a Gaussian Distribution. This is a unique method and it only works for this particular distribution.

1. Draw x1 and x2 from a UNIF(0, 1).
2. Accept the above values only if $x_1^2+x_2^2 \leq 1$. Otherwise repeat the above step until this condition is met.
3. Calculate y1 and y2:
$y_1 = x_1 \frac{(-2log(x_1))^{0.5}}{x_1^2+x_2^2}$
$y_2 = x_2 \frac{(-2log(x_2))^{0.5}}{x_1^2+x_2^2}$
1. y1 and y2 are now independent and distributed N(0,1).

### Rejection Sampling

Suppose that we want to sample from P(x) and we are not in the Gaussian case and we can not find F − 1. Suppose also that there exists a q(x) that is easy to sample from. For instance the UNIF(0,1) is easy to sample from. Then if there exists a k such that $kq(x)\geq p(x)$ for all x then we can use rejection sampling.

Fig.XX Rejection Sampling Example

To present the problem intuitively we can observe the graph (Fig. \ref{fig:RejectSample}) where the top line represents kq(x) and the bottom line represents p(x). We have in our example two points x1 and x2. Consider first x1. From the graph we can tell that values around x1 will be sampled more often under kq(x) than under p(x) and since we are sampling from kq(x) we expect to see many more samples in this region than we actually need. We therefore must reject most of the values drawn from around x1 and only keep a few. If we now look at x2 we see that the number of samples that are drawn from that region and the number we need are in fact much closer and we only have to reject a few of the values that are sampled from that area. So the question is: when we get an xi from kq(x) how do we know if we should keep the value or if we should throw it away? In regions where kq(xi) is far from p(xi) we must reject many more values than in regions where kq(xi) is close to p(xi). This is how rejection sampling works.

1. Draw xi from q(x).
2. Accept xi with probability $\frac{p(x_i)}{kq(x_i)}$ and reject the value otherwise.
3. The accepted values are now a random sample from your P(x).

Proof:
What we need to show is that P(xi | acceptP(xi).

$P(x_i|accept) = \frac{P(accept|x_i)q(x_i)}{P(accept)}$

We know from the definition of the algorithm that $P(accept|x_i) = \frac{p(x_i)}{kq(x_i)}$.

$P(accept) = \int_x P(accept|x)q(x) = \int_x \frac{p(x)}{kq(x)}q(x) = \frac{1}{k}\int_x p(x) = \frac{1}{k}$
$P(x_i|accept) = \frac{\frac{p(x_i)}{kq(x_i)}q(x_i)}{\frac{1}{k}} = P(x_i)$

We have proven that rejection sampling works. But this type of sampling has some disadvantages too. For one thing we can look at the acceptance rate $P(accept) = \frac{1}{k}$. For a large k we are discarding many values and so this method is very inefficient. Also, there are distributions P(x) where it would be difficult to find a suitable q(x) or k that would allow us to sample from P(x).

Example of Rejection Sampling:
Suppose we want to sample from a BETA(2,1).

$BETA(2,1) = \frac{\Gamma(2+1)}{\Gamma(2)\Gamma(1)}x^1(1-x)^0 = 2x \text{ for } 0 \leq x \leq 1$

Now we must find a k and a q(x). We can use the UNIF(0,1) as our q(x) because it is easy to sample from. For the value of k we must find the maximum value of $\frac{P(x)}{q(x)}$. In this case:

$\max \frac{P(x)}{q(x)} = 2 \Rightarrow k \geq 2$

So we will choose our k = 2 for this example and now we can run the algorithm.

1. Draw xi from UNIF(0,1).
2. Accept xi with probability $\frac{2x_i}{2*1} = x_i$ and reject the value otherwise.
3. The accepted values are now a random sample from BETA(2,1).

### Importance Sampling

We return once again to our problem of finding the expectation of f(x).

$E[f] = \int f(x)P(x)dx$

which can be approximated by:

$\frac{1}{n}\sum_{i=1}^n f(x) \text{ where x is drawn from } P(x)$

We can try to rewrite the first equation so that we sample from q(x) and not P(x).

$E[f] = \int f(x) \frac{P(x)}{q(x)}q(x) dx$

which can be approximated by:

$\frac{1}{n}\sum_{i=1}^n f(x)\frac{P(x)}{q(x)} \text{ where x is drawn from } q(x)$

The algorithm is as follows:

1. Draw xi from q(x).
2. Find the weight for xi, $w_i = \frac{P(x_i)}{q(x_i)}$.
3. The set wixi can now be used to estimate E[f].

The main disadvantage is that in many cases we can have the weight very close to zero and the sample itself will become almost useless. We need to have a P(x) and a q(x) that are very close for this algorithm to be more efficient. This technique does turn out to be unbiased but due to the problem of low weights the variance tends to be very high.

### Greedy Importance Sampling

This method, as the name indicates, is somewhat similar to the method in the previous section. The difference from the previous algorithm is that we need to find the maximum point in P(x). The algorithm works as follows:

1. Draw xi1 from q(x).
2. Move from xi1 towards the maximum point in P(x) and sample along the way. The new sample set xi1,...,xik must have the property that $\sum_{j=1}^k w_{ij} = 1$ where wij is the weight of the sample xij.
3. The set wijxij can now be used to estimate E[f].

This method is more difficult to compute but it is unbiased and has the advantage that it also has a low variance. In short this algorithm is more complex than the regular Importance Sampling but it has a lower variance.

### Markov Chain Monte Carlo

This is best explained with an example. Say that we have a series random variables that each have a boolean state. Between two states si and si + 1 we have a set of transition probabilities.

• If si = 0 then si + 1 = 0 with probability $\frac{2}{3}$.
• If si = 0 then si + 1 = 1 with probability $\frac{1}{3}$.
• If si = 1 then si + 1 = 0 with probability $\frac{1}{3}$.
• If si = 1 then si + 1 = 1 with probability $\frac{2}{3}$.

We can say that the initial value for s0 = 1. From that we can deduce that:

• $P(s_1=1) = \frac{2}{3}$ and $P(s_1=0) = \frac{1}{3}$
• $P(s_2=1) = \frac{5}{9}$ and $P(s_2=0) = \frac{4}{9}$
• $P(s_3=1) = \frac{14}{27}$ and $P(s_3=0) = \frac{13}{27}$
• ...
• $P(s_\infty=1) = \frac{1}{2}$ and $P(s_\infty=0) = \frac{1}{2}$

We can see that the probabilities converge to 0.5 each. This is called the equilibrium probability distribution for this particular MCMC. If we have a P(x) we want to sample from but don't know how, there may be a way to make that P(x) the equilibrium probability for a MCMC and then sample from the tail end of the chain to get our random samples.

#### Metropolis Algorithm

We would like to sample from some P(x) and this time use the metropolis algorithm, which is a type of MCMC, to do it. In order for this algorithm to work we first need a number of things.

1. We need some staring value x. This value can come from anywhere.
2. We need to find a value y that comes from the function T(x,y).
3. We need the function T to be symmetrical. T(x,y) = T(y,x).
4. We also need T(x,y) = P(y | x).

Once we have all of these conditions we can run the algorithm to find our random sample.

1. Get a staring value x.
2. Find the y value from the function T(x,y).
3. Accept y with the probability $min(\frac{P(x)}{P(y)}, 1)$.
4. If the y is accepted it becomes the new x value.
5. After a large number of accepted values the series will converge.
6. When the series has converged any new accepted values can be treated as random samples from P(x).

The point at which the series converges is called the 'burn in point'. We must always burn in a series before we can use it to sample because we have to make sure that the series has converged. The number of values before the burn in point depends on the functions we are using since some converge faster than others.
We want to prove that the Metropolis Algorithm works. How do we know that P(x) is in fact the equilibrium distribution for this MC? We have a condition called the detailed balance condition that is sufficient but not necessary when we want to prove that P(x) is the equilibrium distribution.

Theorem 3 If P(x)A(x,y) = P(y)A(y,x) and A(x,y) is the transformation matrix for the MC then P(x) is the equilibrium distribution. This is called the Detailed Balance Condition.

Proof of Sufficiency for Detailed Balance Condition:
Need to show:

 ∫ P(y)A(x,y) = P(x) y
 ∫ P(y)A(y,x) = ∫ P(x)A(x,y) = P(x) ∫ A(x,y) = P(x) y y y

We need to show that Metropolis satisfies the detailed balance condition. We can define A(x,y) as follows:

$A(x, y) = T(x, y) min(\frac{P(x)}{P(y)}, 1)$

Then,

$\begin{matrix} P(x)A(x, y) & = & P(x) T(x, y) min(1 , \frac{P(x)}{P(y)}) \\ & = & min (P(x) T(x, y), P(y)T(x, y)) \\ & = & min (P(x) T(y, x), P(y)T(y, x)) \\ & = & P(y) T(y, x) min(\frac{P(x)}{P(y)}, 1) \\ & = & P(y) A(y, x) \end{matrix}$

Therefore the detailed balance condition holds for the Metropolis Algorithm and we can say that P(x) is the equilibrium distribution.

Example:
Suppose that we want to sample from a Poisson(λ).

$P(x) = \frac{\lambda^x}{x!}e^{-\lambda} \text{ for } x = 0,1,2,3, ...$

Now define T(x,y):y = x + ε where P(ε = − 1) = 0.5 and P(ε = 1) = 0.5. This type of T is called a random walk. We can select any x(0) from the range of x as a starting value. Then we can calculate a y value based on our T function. We will accept the y value as our new x(i) with the probability $min(\frac{P(x)}{P(y)}, 1)$. Once we have gathered many accepted values, say 10000, and the series has converged we can begin to sample from that point on in the series. That sample is now the random sample from a Poisson(λ).

#### Metropolis Hastings

As the name suggests the Metropolis Hastings algorithm is related to the Metropolis algorithm. It is a more generalized version of the Metropolis algorithm where we no longer require the condition that the function T(x,y) be symmetric. The algorithm can be outlined as:

1. Get a staring value x. This value can be chosen at random.
2. Find the y value from the function T(x,y). Note that T(x,y) no longer has to be symmetric.
3. Accept y with the probability $min(\frac{P(y)T(y, x)}{P(x)T(x, y)}, 1)$. Notice how the acceptance probability now contains the function T(x,y).
4. If the y is accepted it becomes the new x value.
5. After a large number of accepted values the series will converge.
6. When the series has converged any new accepted values can be treated as random samples from P(x).

To prove that Metropolis Hastings algorithm works we once again need to show that the Detailed Balance Condition holds.

Proof:
If T(x,y) = T(y,x) then this reduces to the Metropolis algorithm which we have already proven. Otherwise,

$\begin{matrix} A(x, y) & = & T(x,y) min(\frac{P(y)T(y, x)}{P(x)T(x, y)}, 1) \\ P(x)A(x, y) & = & P(x)T(x,y) min(\frac{P(y)T(y, x)}{P(x)T(x, y)}, 1) \\ & = & min(P(y)T(y, x), P(x)T(x,y)) \\ & = & P(y)T(y, x) min(1, \frac{P(x)T(x, y)}{P(y)T(y, x)}) \\ & = & P(y)A(y, x) \end{matrix}$

Which means that the Detailed Balance Condition holds and therefore P(x) is the equilibrium.

### Gibbs Sampling

Suppose we want to sample from the joint probability P(x1,x2,x3) but we cannot sample from it directly. We can however sample from the conditional distribution P(x1 | x2,x3). The process can be defined as follows:

1. Start with a randomly chosen x(0) where $x^{(0)}=(x_1^{(0)}, x_2^{(0)}, x_3^{(0)})$.
2. Once we have an x(t) we can find an x(t + 1) by sampling from the conditional probability distribution.
$\begin{matrix} x_1^{(t+1)} & = & P(x_1^{(t)} | x_2^{(t)}, x_3^{(t)}) \\ x_2^{(t+1)} & = & P(x_2^{(t)} | x_1^{(t+1)}, x_3^{(t)}) \\ x_3^{(t+1)} & = & P(x_3^{(t)} | x_1^{(t+1)}, x_2^{(t+1)}) \end{matrix}$
1. We continue this process until the burn-in point, after which we are sampling from P(x).

This process may seem different from the previous methods but in fact Gibbs Sampling is only a special case of Metropolis Hastings. Suppose one would like to sample from P(x) where $x=(x_1, x_2, x_3 \dots x_d) \varepsilon R^d$. Propose a $y_{-q} = (x_1, \dots, x_{q-1}, x_{q+1}, \dots, x_d)$ and a yq = xq. We can define the T(x,y) function from the Metropolis Hastings algorithm as T(x,y) = P(yq | yq) = P(yq | xq). In Gibbs Sampling we do not reject any of the values we sampled because our rejection probability is:

$\begin{matrix} P(reject) & = & min(\frac{P(y)T(y, x)}{P(x)T(x, y)}, 1) \\ & = & min(\frac{P(y)P(x_q | x_{-q})}{P(x)P(y_q | x_{-q})}, 1) \\ & = & min(\frac{P(y_q | x_{-q})P(x_{-q})P(x_q | x_{-q})}{P(x_q | x_{-q})P(x_{-q})P(y_q | x_{-q})}, 1) \\ & = & min (1,1) = 1 \end{matrix}$

This quality makes Gibbs Sampling quite popular because we use everything we sample.

Example:
Say that we want to sample from:

N \left[


\left( \begin{array}{c} u_1
u_2 \end{array} \right), \left( \begin{array}{cc} \Sigma_{11} & \Sigma_{12}
\Sigma_{21} & \Sigma_{22} \end{array} \right) \right ] And we know that we can find the parameters with:

$\begin{matrix} \mu_{1,2} & = & \mu_1+\Sigma_{12}\Sigma_{22}^{-1}(x_{2,1}-\mu_2) \\ \Sigma_{1,2} & = & \Sigma_{11} - \Sigma_{121}\Sigma_{22}^{-1}\Sigma_{21} \end{matrix}$

For this example suppose we want to sample from :

N \left[


\left( \begin{array}{c} 0
0 \end{array} \right), \left( \begin{array}{cc} 1 & L
L & 1 \end{array} \right) \right ] Then we can calculate:

$\begin{matrix} \mu_{1,2} & = & L x_{2,1} \\ \Sigma_{1,2} & = & 1 - L^2 \end{matrix}$

The sampling process is then done with:

$\begin{matrix} x_1^{(t+1)} & = & N(Lx_2^{(t)}, 1-L^2) \\ x_2^{(t+1)} & = & N(Lx_1^{(t+1)}, 1-L^2) \end{matrix}$

### Independence Chains

In the Metropolis Hastings algorithm we used a T(x,y) to get the next values in the sample. Suppose now that T(x,y) = T(y). In other words, the function T does not depend on x. The acceptance probability would now become $min(1, \frac{P(y)T(x)}{P(x)T(y)})$.

#### Bayesian Inference

In Bayesian Inference we would like to find P(θ | Data). Suppose we use the prior on θ as the transition function and then we apply Metropolis Hastings. Our acceptance probability would become:

$min \left( 1, \frac{P(\theta^{(t+1)}|Data)P(\theta^{(t)})} { P(\theta^{(t)}|Data)P(\theta^{(t+1)})} \right)$

Now, recall that using Bayes rule we can write $P(\theta|Data) =\frac{ P(Data|\theta)P(\theta) } {P(Data)}$. We also know that P(Data | θ) = Likelihood. From that we can rewrite the above Bayes formula as $P(\theta|Data) =\frac{ L(Data;\theta)P(\theta) } {P(Data)}$.

Therefore, to sample from the posterior in a Bayesian Inference we can simply propose a θ(t + 1) from the prior and then we accept with probability:

$\begin{matrix} AcceptanceProb & = & min \left( 1, \frac{P(\theta^{(t+1)}|Data) P(\theta^{(t)})} {P(\theta^{(t)}|Data)P(\theta^{(t+1)})} \right) \\ & = & min \left( 1, \frac{L(Data; \theta^{(t+1)})P(\theta^{(t)})P(\theta^{(t+1)})} { L(Data; \theta^{(t)})P(\theta^{(t+1)})P(\theta^{(t)})} \right) \\ & = & min \left( 1, \frac{L(Data; \theta^{(t+1)})} { L(Data; \theta^{(t)})} \right) \end{matrix}$

Example:
We would like to sample from:

N(7,0.25) with probability α

and from:

N(10,0.25) with probability (1 − α)

The problem is that we are missing the parameter α. We do however know that P(α) = UNIF(0,1). The best way to sample from the above distribution is to start with a randomly chosen α(t) and accept with probability $min \left( 1, \frac{L(Data; \theta^{(t+1)})} { L(Data; \theta^{(t)})} \right)$. When we reject we simply use the previous value again. This method also requires a burn in time so we must wait before we can begin sampling.

### Simulated Annealing

Consider the general optimization problem minxh(x) and the distribution P(x). Instead of finding the minimum of h(x) we can try to find the maximum of $P(x)\propto exp\left\lbrace \frac{-h(x)}{T} \right\rbrace$. In this case T is called the temperature and it determines the shape of the distribution. As T increases the distribution expands but as $T\rightarrow0$ then the xi that we sample from the P(x) are very close to the global min.

Note: If x is the minimum of h(x) then x is also the maximum of P(x).

We can define the steps to the problem as:

1. Start with a randomly chosen x and set T to a large value.
2. Propose a $y \neq x$ from the function T(x,y) = T(y,x).
3. Accept the y value with probability $min(1, \frac{P(y)}{P(x)})$.

But what exactly does $\frac{P(x)}{P(y)}$ mean? We can estimate each of these probabilities with the $exp\left\lbrace \frac{-h(x)}{T} \right\rbrace$ expression we introduced earlier.

$\begin{matrix} \frac{P(y)}{P(x)} & = & \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}} \\ & = & e^{\frac{h(x) - h(y)}{T}} \end{matrix}$

We are now left with two possible cases. If h(y) < h(x) then P(y) > P(x) which is desired and so we will always accept the new y. Otherwise, if h(y) > h(x) we may not accept the new y value and we can see that as $T \rightarrow 0$ then $e^{\frac{h(x) - h(y)}{T}}$ will also go to zero and so the acceptance probability will go to zero.

For this method we can write down a rough algorithm:
Start with x0 and consider a set $T_1 > T_2 > \dots > T_k$ of K values.
for k = 1 to K
\hspace*{20pt}for j = 1 to Nk
\hspace*{20pt}Propose a y from T(y,x).
\hspace*{20pt}U = UNIF(0,1)
\hspace*{20pt}if $U \leq min(1, \frac{P(y)}{P(x_{j-1})})$
\hspace*{30pt}xj = y
\hspace*{20pt}else
\hspace*{30pt}xj = xj − 1
\hspace*{20pt}endif
endfor
endfor

### Bootstrap

In data analysis we usually have an observed set of data $\left\lbrace x_1, x_2, \dots, x_n \right\rbrace$ from a probability distribution P and we have an estimator $\hat{\theta}$ for our parameter of interest θ. In general it would be useful to know the distribution of our $\hat{\theta}$. For instance, if the estimator has a larger variance then we know that it is not very accurate. The problem is that it is not always easy to determine the distribution of an estimator. Ideally we would like to be able to sample directly from P and then for each sample of size n we can calculate a $\hat{\theta}$. In this way a number of estimates for θ can be found and their distribution can be determined from the samples.

For Example:

$\begin{matrix} \lbrace x_1^{(1)}, x_2^{(1)}, \dots, x_n^{(1)} \rbrace & \Rightarrow & \hat{\theta_1} \\ \lbrace x_1^{(2)}, x_2^{(2)}, \dots, x_n^{(2)} \rbrace & \Rightarrow & \hat{\theta_2} \\ \dots & & \\ \lbrace x_1^{(B)}, x_2^{(B)}, \dots, x_n^{(B)} \rbrace & \Rightarrow & \hat{\theta_B} \end{matrix}$

Based on $\lbrace \hat{\theta_1}, \hat{\theta_2}, \dots, \hat{\theta_B} \rbrace$ we can try to determine the distribution of $\hat{\theta}$.

However, this idea is unrealistic because we don't know P and so we cannot sample from it. This is where the Bootstrap idea comes in. Assume that we have a set of data $\left\lbrace x_1, x_2, \dots, x_n \right\rbrace$ from an unknown distribution P. To simulate sampling from P we can resample with replacement from the set of n data points. Every sample we get in this way we can use to estimate a different $\hat{\theta}$. We can use this method to find a collection of $\hat{\theta_i}$ parameters from which we can:

1. Find the expectation of $\hat{\theta}$.
$E(\hat{\theta}) = \frac{1}{B} \sum_{r=1}^B \hat{\theta_i}$
1. Find the variance of $\hat{\theta}$.
$Var(\hat{\theta}) = \frac{1}{B-1}\sum_{r=1}^B(\hat{\theta_i} - E(\hat{\theta}))^2$
1. Find a confidence interval.
$(\hat{\theta} - 2*S.E., \hat{\theta} + 2*S.E.)$
1. Find the bias.
$bias(\hat{\theta}) = \hat{\theta}_{original} - E(\hat{\theta})$
1. Bias correction.
$\hat{\theta} - bias$

At first, this method seems strange. We are sampling from the sample itself and not the distribution. However, it has been shown that the Bootstrap method does indeed work and can provide more useful information on top of what the raw data could have provided.

This kind of Bootstrap is called the Naive Bootstrap because the values are sampled one at a time independently and this destroys any kind of correlation in the initial distribution. The correct Bootstrap method requires the selection of blocks of data in order to keep the correlation in the data. These blocks are sampled with replacement and may overlap.