Stat341 / CM 361

From Wiki Course Notes

(Redirected from Stat341)
Jump to: navigation, search

Computational Statistics and Data Analysis is a course offered at the University of Waterloo
Spring 2009
Instructor: Ali Ghodsi


Sampling (Generating random numbers)

Generating Random Numbers - May 12, 2009

Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudo-random). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).

We begin by considering the simplest case: the uniform distribution.

Multiplicative Congruential Method

One way to generate pseudo random numbers from the uniform distribution is using the Multiplicative Congruential Method. This involves three integer parameters a, b, and m, and a seed variable x0. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:

x_{i+1} = (ax_{i} + b) \mod{m}

For example, with a = 13, b = 0, m = 31, x0 = 1, we have:

x_{i+1} = 13x_{i} \mod{31}


\begin{align} x_{0} &{}= 1 \end{align}
x_{1} &{}= 13 \times 1  + 0 \mod{31} = 13 \\
x_{2} &{}= 13 \times 13 + 0 \mod{31} = 14 \\
x_{3} &{}= 13 \times 14 + 0 \mod{31} =27 \\


The above generator of pseudorandom numbers is called a Mixed Congruential Generator or Linear Congruential Generator, as they involve both an additive and a muliplicative term. For correctly chosen values of a, b, and m, this method will generate a sequence of integers including all integers between 0 and m - 1. Scaling the output by dividing the terms of the resulting sequence by m - 1, we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.

Of course, not all values of a, b, and m will behave in this way, and will not be suitable for use in generating pseudo random numbers.

For example, with a = 3, b = 2, m = 4, x0 = 1, we have:

x_{i+1} = (3x_{i} + 2) \mod{4}


\begin{align} x_{0} &{}= 1 \end{align}
x_{1} &{}= 3 \times 1 + 2 \mod{4} = 1 \\
x_{2} &{}= 3 \times 1 + 2 \mod{4} = 1 \\


For an ideal situation, we want m to be a very large prime number, x_{n}\not= 0 for any n, and the period is equal to m-1. In practice, it has been found by a paper published in 1988 by Park and Miller, that a = 75, b = 0, and m = 231 - 1 = 2147483647 (the maximum size of a signed integer in a 32-bit system) are good values for the Multiplicative Congruential Method.

Java's Random class is based on a generator with a = 25214903917, b = 11, and m = 248[1]. The class returns at most 32 leading bits from each xi, so it is possible to get the same value twice in a row, (when x0 = 18698324575379, for instance) without repeating it forever.

General Methods

Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.

Inverse Transform Method

This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:


If U \sim~ \mathrm{Unif}[0, 1] is a random variable and X = F − 1(U), where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).


Recall that, if f is the pdf corresponding to F where f is defined as 0 outside of its domain, then,

F(x) = P(X \leq x) = \int_{-\infty}^x f(x)
\int_1^{\infty} \frac{x^k}{x^2} dx

So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.

Note also that in the uniform distribution on [0, 1], we have for all a within [0, 1], P(U \leq a) = a.


P(F^{-1}(U) \leq x) &{}= P(F(F^{-1}(U)) \leq F(x)) \\
                    &{}= P(U \leq F(x)) \\
                    &{}= F(x)

Completing the proof.

Procedure (Continuous Case)

This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:

  • Step 1: Draw U \sim~ Unif [0, 1] .
  • Step 2: Compute X = F − 1(U).


Suppose we want to draw a sample from f(x) = λe − λx where x > 0 (the exponential distribution).

We need to first find F(x) and then its inverse, F − 1.

 F(x) = \int^x_0 \theta e^{-\theta u} du = 1 - e^{-\theta x}
 F^{-1}(x) = \frac{-\log(1-y)}{\theta} = \frac{-\log(u)}{\theta}

Now we can generate our random sample u_1\dots u_n from F(x) by:

1)\ u_i \sim Unif[0, 1]
2)\ x_i = \frac{-\log(u_i)}{\theta}

The xi are now a random sample from f(x).

This example can be illustrated in Matlab using the code below. Generate ui, calculate xi using the above formula and letting θ = 1, plot the histogram of xi's for i = 1,...,100,000.

"Histogram showing the expected exponentional distribution"

The major problem with this approach is that we have to find F − 1 and for many distributions it is too difficult (or impossible) to find the inverse of F(x). Further, for some distributions it is not even possible to find F(x) (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find F − 1).

Procedure (Discrete Case)

The above method can be easily adapted to work on discrete distributions as well.

In general in the discrete case, we have x_0, \dots , x_n where:

\begin{align}P(X = x_i) &{}= p_i \end{align}
x_0 \leq x_1 \leq x_2 \dots \leq x_n
\sum p_i = 1

Thus we can define the following method to find pseudo random numbers in the discrete case (note that the less-than signs from class have been changed to less-than-or-equal-to signs by me, since otherwise the case of U = 1 is missed):

  • Step 1: Draw  U~ \sim~ Unif [0,1] .
  • Step 2:
    • If U < p0, return X = x0
    • If p_0 \leq U < p_0 + p_1, return X = x1
    • ...
    • In general, if p_0+ p_1 + \dots + p_{k-1} \leq U < p_0 + \dots + p_k, return X = xk

Example (from class):

Suppose we have the following discrete distribution:

P(X = 0) &{}= 0.3 \\
P(X = 1) &{}= 0.2 \\
P(X = 2) &{}= 0.5

The cumulative density function (cdf) for this distribution is then:

F(x) = \begin{cases}
0, & \text{if } x < 0 \\
0.3, & \text{if } 0 \leq x < 1 \\
0.5, & \text{if } 1 \leq x < 2 \\
1, & \text{if } 2 \leq x

Then we can generate numbers from this distribution like this, given u_0, \dots, u_n from U \sim~ Unif[0, 1]:

x_i = \begin{cases}
0, & \text{if } u_i \leq 0.3 \\
1, & \text{if } 0.3 < u_i \leq 0.5 \\
2, & \text{if } 0.5 < u_i \leq 1

This example can be illustrated in Matlab using the code below:

for i=1:1000;
  if u <= p(1)
  elseif u < sum(p(1,2))

Acceptance-Rejection Sampling - May 14, 2009

Today, we continue the discussion on sampling (generating random numbers) from general distributions with the Acceptance/Rejection Method.

Acceptance/Rejection Method

Ali: Some statements are incorrect, inaccurate or misleading. Acceptance-Rejection Method needs to be motivated in more details.

Suppose we wish to sample from a target distribution f(x) that is difficult to sample from directly.

Let g(x) be a distribution that is easy to sample from and satisfies the condition:

\forall x: f(x) \leq c \cdot g(x)\ , where  c \in \Re^+

"Graph of the pdf of f(x) (target distribution) and cg(x) (proposal distribution)"

Since c*g(x) > f(x) for all x, it is possible to obtain samples that follows f(x) by rejecting a proportion of samples drawn from c*g(x).

This proportion depends on how different f(x) and g(x) are and may vary at different values of x.

That is, if  f(x) \approx g(x) \text { at } x = x_1  \text { and } f(x) \ll g(x) \text { at } x = x_2 , we will need to reject more samples drawn at  \,x_2 than at  \,x_1 .

Overall, it can shown that by accepting samples drawn from g(x) with probability  \frac {f(x)}{c \cdot g(x)} , we can obtain samples that follows f(x)

Consider the example in the graph,
Sampling y = 7 from cg(x) will yield a sample that follows the target distribution f(x) and will y be accepted w/p 1.

Sampling y = 9 from cg(x) will yield a point that is distant from f(x) and will be accepted with a low probability.


Ali: Proof of what? .

Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.

 P(X=x|accept) = \frac{P(accept|X=x)P(X=x)}{P(accept)}
by Bayes' theorem

\begin{align} &P(accept|X=x) = \frac{f(x)}{c \cdot g(x)}\\ &Pr(X=x) = g(x)\frac{}{} \end{align}

by hypothesis.

\begin{align} P(accept) &= \int^{}_x P(accept|X=x)P(X=x) dx \\
           &= \int^{}_x \frac{f(x)}{c \cdot g(x)} g(x) dx\\
           &= \frac{1}{c} \int^{}_x f(x) dx\\
           &= \frac{1}{c} \end{align}

 P(X=x|accept) = \frac{\frac{f(x)}{c\ \cdot g(x)}g(x)}{\frac{1}{c}} = f(x) as required.

Procedure (Continuous Case)

  • Choose g(x) (a density function that is simple to sample from)
  • Find a constant c such that : c \cdot g(x) \geq f(x)
  1. Let Y \sim~ g(y)
  2. Let U \sim~ Unif [0,1]
  3. If U \leq \frac{f(x)}{c \cdot g(x)} then X=Y; else reject and go to step 1


Suppose we want to sample from Beta(2,1), for  0 \leq x \leq 1 . Recall:

 Beta(2,1) = \frac{\Gamma (2+1)}{\Gamma (2) \Gamma(1)} \times x^1(1-x)^0 = \frac{2!}{1!0!} \times x = 2x
  • Choose  g(x) \sim~ Unif [0,1]
  • Find c: c = 2 (see notes below)
  1. Let  Y \sim~ Unif [0,1]
  2. Let  U \sim~ Unif [0,1]
  3. If U \leq \frac{2Y}{2} = Y , then X=Y; else go to step 1

c was chosen to be 2 in this example since  \max \left(\frac{f(x)}{g(x)}\right) in this example is 2. This condition is important since it helps us in finding a suitable c to apply the Acceptance/Rejection Method.

In MATLAB, the code that demonstrates the result of this example is:

   j = 1;
       while i < 1000
           y = rand;
           u = rand;
           if u <= y
               x(j) = y;
               j = j + 1;
               i = i + 1;

The histogram produced here should follow the target distribution, f(x) = 2x, for the interval  0 \leq x \leq 1 .

The histogram shows the PDF of a Beta(2,1) distribution as expected.


The Discrete Case

The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution g(x) must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.


Suppose we want to sample from a distribution with the following probability mass function (pmf):

P(X=1) = 0.15
P(X=2) = 0.55
P(X=3) = 0.20
P(X=4) = 0.10 
  • Choose g(x) to be the discrete uniform distribution on the set {1,2,3,4}
  • Find c:  c = \max \left(\frac{f(x)}{g(x)} \right)= 0.55/0.25 = 2.2
  1. Generate  Y \sim~ Unif \{1,2,3,4\}
  2. Let  U \sim~ Unif [0,1]
  3. If U \leq \frac{f(x)}{2.2 \times 0.25} , then X=Y; else go to step 1


If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such g(x) that satisfies all the conditions of the procedure.

We will now begin to discuss sampling from specific distributions.

Special Technique for sampling from Gamma Distribution

Recall that the cdf of the Gamma distribution is:

 F(x) = \int_0^{\lambda x} \frac{e^{-y}y^{t-1}}{(t-1)!} dy

If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.

Additive Property of Gamma Distribution

Recall that if X_1, \dots, X_t are independent exponential distributions with mean λ (in other words,  X_i\sim~ Exp (\lambda) ), then  \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda)

It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean λ (using the Inverse Method) and add them up. Details will be discussed in the next lecture.

Techniques for Normal and Gamma Sampling - May 19, 2009

We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,
Gamma distributions and Normal distributions, where such practical methods exist.

Gamma Distribution

Given the additive property of the gamma distribution,

If X_1, \dots, X_t are independent random variables with  X_i\sim~ Exp (\lambda) then,

 \Sigma_{i=1}^t X_i \sim~ Gamma (t, \lambda)

Using this property, we can use the Inverse Transform Method to generate samples from an exponential distribution with appropriate variables, and use these to generate a sample following a Gamma distribution.

  1. Sample independently from a uniform distribution t times, giving  u_1,\dots,u_t
  2. Sample independently from an exponential distribution t times, giving  x_1,\dots,x_t such that,
     \begin{align} x_1 \sim~ Exp(\lambda)\\ \vdots \\ x_t \sim~ Exp(\lambda) \end{align}

    Using the Inverse Transform Method,
     \begin{align} x_i = -\frac {1}{\lambda}\log(u_i)  \end{align}
  3. Using the additive property,
     \begin{align}  X &{}= x_1 + x_2 + \dots + x_t \\ X &{}= -\frac {1}{\lambda}\log(u_1) - \frac {1}{\lambda}\log(u_2)  \dots - \frac {1}{\lambda}\log(u_t) \\ X &{}= -\frac {1}{\lambda}\log(\prod_{i=1}^{t}u_i) \sim~ Gamma (t, \lambda)  \end{align}

This procedure can be illustrated in Matlab using the code below with t = 5,λ = 1 :

U = rand(10000,5);
X = sum( -log(U), 2);


Normal Distribution

"Diagram of the Box Muller transform, which transforms uniformly distributed value pairs to normally distributed value pairs." [Box-Muller Transform, Wikipedia]

The cdf for the Standard Normal distribution is:

 F(Z) = \int_{-\infty}^{Z}\frac{1}{\sqrt{2\pi}}e^{-x^2/2}dx

We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the Acceptance-Rejection method, but there are still better ways to sample from a Standard Normal Distribution.

Box-Muller Method

[Add a picture WikiSysop 19:25, 1 June 2009 (UTC)]

The Box-Muller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.

Properties of Polar and Cartesian Coordinates
If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,
  •  \begin{align} r^2 = x^2 + y^2 \end{align}
  •  \tan(\theta) = \frac{y}{x}

  •  \begin{align} x = r \cos(\theta) \end{align}
  •  \begin{align} y = r \sin(\theta) \end{align}

Let X and Y be independent random variables with a standard normal distribution,

 X \sim~ N(0,1)
 Y \sim~ N(0,1)

also, let r and θ be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,

\begin{align} f(x,y) = f(x)f(y) &{}= \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\frac{1}{\sqrt{2\pi}}e^{-\frac{y^2}{2}} \\ 
   &{}=\frac{1}{2\pi}e^{-\frac{x^2+y^2}{2}} \end{align}

It can also be shown that the joint distribution of r and θ is given by,

\begin{matrix}  f(r,\theta) = \frac{1}{2}e^{-\frac{d}{2}}\times\frac{1}{2\pi},\quad  d = r^2 \end{matrix}

Note that  \begin{matrix}f(r,\theta)\end{matrix} consists of two density functions, Exponential and Uniform, so assuming that r and θ are independent  \begin{matrix} \Rightarrow d \sim~ Exp(1/2),  \theta \sim~ Unif[0,2\pi] \end{matrix}

Procedure for using Box-Muller Method
  1. Sample independently from a uniform distribution twice, giving  \begin{align} u_1,u_2 \sim~ \mathrm{Unif}(0, 1) \end{align}
  2. Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,
  d = -2\log(u_1),& \quad r = \sqrt{d} \\ & \quad \theta = 2\pi u_2  \end{align}
  3. Transform r and θ back to x and y,
     \begin{align} x = r\cos(\theta) \\ y = r\sin(\theta) \end{align}

Notice that the Box-Muller Method generates a pair of independent Standard Normal distributions, x and y.

This procedure can be illustrated in Matlab using the code below:

u1 = rand(5000,1);
u2 = rand(5000,1);

d = -2*log(u1);
theta = 2*pi*u2;

x = d.^(1/2).*cos(theta);
y = d.^(1/2).*sin(theta);




Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:

title('d follows an exponential distribution');
title('theta follows a uniform distribution over [0, 2*pi]');


Useful Properties (Single and Multivariate Normal)

The Box-Muller method as described above samples only from the standard normal distribution. However, both singlevariate and multivariate normal distributions have properties that allow us to use samples generated by the Box-Muller method to sample any normal distribution in general.

Properties of Normal distributions
  •  \begin{align} \text{If } & X = \mu + \sigma Z, &  Z \sim~ N(0,1) \\ &\text{then }  X \sim~ N(\mu,\sigma ^2) \end{align}
  •  \begin{align} \text{If } & \vec{Z} = (Z_1,\dots,Z_d)^T, &  Z_1,\dots,Z_d \sim~ N(0,1) \\ &\text{then }  \vec{Z} \sim~ N(\vec{0},I) \end{align}
  •  \begin{align} \text{If } & \vec{X} = \vec{\mu} + \Sigma^{1/2} \vec{Z}, &  \vec{Z} \sim~ N(\vec{0},I) \\ &\text{then }  \vec{X} \sim~ N(\vec{\mu},\Sigma) \end{align}

These properties can be illustrated through the following example in Matlab using the code below:

Example: For a Multivariate Normal distribution u=\begin{bmatrix} -2 \\ 3 \end{bmatrix} and \Sigma=\begin{bmatrix} 1&0.5\\ 0.5&1\end{bmatrix}

u = [-2; 3];
sigma = [ 1 1/2; 1/2 1];

r = randn(15000,2);
ss = chol(sigma);

X = ones(15000,1)*u' + r*ss;
plot(X(:,1),X(:,2), '.');


Note: In the example above, we generated the square root of Σ using the Cholesky decomposition,

ss = chol(sigma);

which gives ss=\begin{bmatrix} 1&0.5\\ 0&0.8660\end{bmatrix}. Matlab also has the sqrtm function:

ss = sqrtm(sigma);

which gives a different matrix, ss=\begin{bmatrix} 0.9659&0.2588\\ 0.2588&0.9659\end{bmatrix}, but the plot looks about the same (X has the same distribution).

Bayesian and Frequentist Schools of Thought - May 21, 2009

In this lecture we will continue to discuss sampling from specific distributions , introduce Monte Carlo Integration, and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to Bayesian Inference.

Binomial Distribution

A Binomial distribution X \sim~ Bin(n,p) is the sum of n independent Bernoulli trials, each with probability of success p (0 \leq p \leq 1). For each trial we generate an independent uniform random variable: U_1, \ldots, U_n \sim~ Unif(0,1). Then X is the number of times that U_i \leq p. In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.

Sampling from Binomial distribution in Matlab is done using the following code:


Where the histogram is a Binomial distribution, and for higher n, it would resemble a Normal distribution.

Monte Carlo Integration

Monte Carlo Integration is a numerical method of approximating the evaluation of integrals by simulation. In this course we will mainly look at three methods for approximating integrals:

  1. Basic Monte Carlo Integration
  2. Importance Sampling
  3. Markov Chain Monte Carlo (MCMC)

Bayesian VS Frequentists

During the history of statistics, two major schools of thought emerged along the way and have been locked in an on-going struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:


  • Probability is objective and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.
  • Parameters are all fixed and unknown constants.
  • Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.


  • Probability is subjective and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.
  • Parameters are random variables that has a given distribution, and other probability statements can be made about them.
  • Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution.

Bayesian Inference

Example: If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as \vec{X}, which is our data, and the parameter of the data for this case, which we will refer to as θ, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding  Pr(\theta=a|\vec{X}=\vec{x}), whereas a Frequentist would be more interested in finding  Pr(\vec{X}=\vec{x}|\theta=a)

Bayes' Rule
f(\theta|X) = \frac{f(X | \theta)\, f(\theta)}{f(X)}.

Note: In this case f(θ | X) is referred to as posterior, f(X | θ) as likelihood, f(θ) as prior, and f(X) as the marginal, where θ is the parameter and X is the observed variable.

Procedure in Bayesian Inference

  • First choose a probability distribution as the prior, which represents our beliefs about the parameters.
  • Then choose a probability distribution for the likelihood, which represents our beliefs about the data.
  • Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.

As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.


  • Mode: \theta = \arg\max_{\theta} f(\theta|X) \gets value of θ that maximizes f(θ | X)
  • Mean:  \bar\theta = \int^{}_\theta \theta \cdot f(\theta|X)d\theta

If it is the case that θ is high-dimensional, and we are only interested in one of the components of θ, for example, we want θ1 from  \vec{\theta}=(\theta_1,\dots,\theta_n), then we would have to calculate the integral: \int^{} \int^{} \dots \int^{}f(\theta|X)d\theta_2d\theta_3 \dots d\theta_n

This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.


  1. f(x)=\int^{}_\theta f(X | \theta)f(\theta) d\theta is not a function of θ, and is called the Normalization Factor
  2. Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior: f(\theta|X)\propto f(X | \theta)f(\theta)

Monte Carlo Integration - May 26, 2009

Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces Basic Monte Carlo Integration.

Frequentist vs Bayesian Example - Estimating Parameters

Estimating parameters of a univariate Gaussian:

Frequentist: f(x|\theta)=\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}*(\frac{x-\mu}{\sigma})^2}
Bayesian: f(\theta|x)=\frac{f(x|\theta)f(\theta)}{f(x)}

Frequentist Approach

Let XN denote (x1,x2,...,xn). Using the Maximum Likelihood Estimation approach for estimating parameters we get:

L(X^N; \theta) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i- \mu} {\sigma})^2}
l(X^N; \theta) = \sum_{i=1}^N -\frac{1}{2}log (2\pi) - log(\sigma) - \frac{1}{2} \left(\frac{x_i- \mu}{\sigma}\right)^2
\frac{dl}{d\mu} = \displaystyle\sum_{i=1}^N(x_i-\mu)

Setting \frac{dl}{d\mu} = 0 we get

\displaystyle\sum_{i=1}^Nx_i = \displaystyle\sum_{i=1}^N\mu
\displaystyle\sum_{i=1}^Nx_i = N\mu \rightarrow \mu = \frac{1}{N}\displaystyle\sum_{i=1}^Nx_i
Bayesian Approach

Assuming the prior is Gaussian:

P(\theta) = \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}
f(\theta|x) \propto \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2}(\frac{x_i-\mu}{\sigma})^2} * \frac{1}{\sqrt{2\pi}\tau}e^{-\frac{1}{2}(\frac{x-\mu_0}{\tau})^2}

By completing the square we conclude that the posterior is Gaussian as well:



\tilde{\mu} = \frac{\frac{N}{\sigma^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\bar{x} + \frac{\frac{1}{\tau^2}}{{\frac{N}{\sigma^2}}+\frac{1}{\tau^2}}\mu_0

The expectation from the posterior is different from the MLE method. Note that \displaystyle\lim_{N\to\infty}\tilde{\mu} = \bar{x}. Also note that when N = 0 we get \tilde{\mu} = \mu_0.

Basic Monte Carlo Integration

Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term Monte Carlo is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. Stochastic simulation refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as Monte Carlo methods. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [1]

We start with a simple example:

\begin{align}I &= \displaystyle\int_a^b h(x)\,dx
    &= \displaystyle\int_a^b w(x)f(x)\,dx \end{align}


\displaystyle w(x) = h(x)(b-a)
f(x) = \frac{1}{b-a} \rightarrow the p.d.f. is Unif(a,b)
\hat{I} = E_f[w(x)] = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)

If x_i \sim~ Unif(a,b) then by the Law of Large Numbers \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i) \rightarrow \displaystyle\int w(x)f(x)\,dx = E_f[w(x)]


Given I = \displaystyle\int^b_ah(x)\,dx

  1. \begin{matrix} w(x) = h(x)(b-a)\end{matrix}
  2.  \begin{matrix} x_1, x_2, ..., x_n \sim UNIF(a,b)\end{matrix}
  3. \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)

From this we can compute other statistics, such as

  1.  SE=\frac{s}{\sqrt{N}} where s^2=\frac{\sum_{i=1}^{N}(Y_i-\hat{I})^2 }{N-1} with  \begin{matrix}Y_i=w(x_i)\end{matrix}
  2. \begin{matrix} 1-\alpha \end{matrix} CI's can be estimated as  \hat{I}\pm Z_\frac{\alpha}{2}\times SE

Example 1

Find  E[\sqrt{x}] for \begin{matrix} f(x) = e^{-x}\end{matrix}

  1. We need to draw from \begin{matrix} f(x) = e^{-x}\end{matrix}
  2. \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nw(x_i)

This example can be illustrated in Matlab using the code below:

h= x.^ .5
%The value obtained using the Monte Carlo method
F = @ (x) sqrt (x). * exp(-x)
%The value of the real function using Matlab

Example 2 Find  I = \displaystyle\int^1_0h(x)\,dx, h(x) = x^3

  1.  \displaystyle I = x^4/4 = 1/4
  2. \displaystyle  W(x) = x^3*(1-0)
  3.  Xi \sim~Unif(0,1)
  4. \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(x_i^3)

This example can be illustrated in Matlab using the code below:

x = rand (1000)

Example 3 To estimate an infinite integral such as  I = \displaystyle\int^\infty_5 h(x)\,dx, h(x) = 3e^{-x}

  1. Substitute in  y=\frac{1}{x-5+1} => dy=-\frac{1}{(x-4)^2}dx => dy=-y^2dx
  2.  I = \displaystyle\int^1_0 \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}\,dy
  3.  w(y) = \frac{3e^{-(\frac{1}{y}+4)}}{-y^2}(1-0)
  4.  Y_i \sim~Unif(0,1)
  5. \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^N(\frac{3e^{-(\frac{1}{y_i}+4)}}{-y_i^2})

Importance Sampling and Monte Carlo Simulation - May 28, 2009

During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.

Binomial Probability Monte Carlo Simulations

Example 1:

You are given two independent Binomial distributions with probabilities \displaystyle p_1\text{, }p_2. Using a Monte Carlo simulation, approximate the value of \displaystyle \delta, where \displaystyle \delta = p_1 - p_2.

\displaystyle X \sim BIN(n, p_1); \displaystyle Y \sim BIN(m, p_2); \displaystyle \delta = p_1 - p_2

So \displaystyle f(p_1, p_2 | x,y) = \frac{f(x, y|p_1, p_2)f(p_1,p_2)}{f(x,y)} where \displaystyle f(x,y) is a flat distribution and the expected value of \displaystyle \delta is as follows:

\displaystyle E(\delta) = \int\int\delta f(p_1,p_2|X,Y)\,dp_1dp_2

Since X, Y are independent, we can split the conditional probability distribution:

\displaystyle f(p_1,p_2|X,Y) \propto f(p_1|X)f(p_2|Y)

We need to find conditional distribution functions for \displaystyle p_1, p_2 to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let

\displaystyle f(p_1 | X) \sim \text{Beta}(x+1, n-x+1) and \displaystyle f(p_2 | Y) \sim \text{Beta}(y+1, m-y+1), where
\displaystyle \text{Beta}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}p^{\alpha-1}(1-p)^{\beta-1}


  1. Draw samples for \displaystyle p_1 and \displaystyle p_2: \displaystyle (p_1,p_2)^{(1)}, \displaystyle (p_1,p_2)^{(2)}, ..., \displaystyle (p_1,p_2)^{(n)};
  2. Compute \displaystyle \delta = p_1 - p_2 in order to get n values for \displaystyle \delta;
  3. \displaystyle \hat{\delta}=\frac{1}{N}\sum_{\forall i}\delta^{(i)}.

Matlab Code:

The Matlab code for recreating the above example is as follows:
n=100; %number of trials for X
m=100; %number of trials for Y
x=80; %number of successes for X trials
y=60; %number of successes for y trials
p1=betarnd(x+1, n-x+1, 1, 1000);
p2=betarnd(y+1, m-y+1, 1, 1000);

The mean in this example is given by 0.1938.

A 95% confidence interval for δ is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:


The interval is approximately  95%  CI \approx (0.06606, 0.32204)

The histogram of delta is:
Delta hist.jpg

Note: In this case, we can also find E(δ) analytically since E(\delta) = E(p_1 - p_2) = E(p_1) - E(p_2) = \frac{x+1}{n+2} - \frac{y+1}{m+2} \approx 0.1961 . Compare this with the maximum likelihood estimate for δ: \frac{x}{n} - \frac{y}{m} = 0.2.

Example 2:

Bayesian Inference for Dose Response

We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:

\displaystyle x_1<x_2<...<x_{10}

For each dose \displaystyle x_i we test n rats and observe \displaystyle Y_i, the number of rats that die. Therefore,

\displaystyle Y_i \sim~ BIN(n, p_i)

We can assume that the probability of death grows with the concentration of drug given, i.e. \displaystyle p_1<p_2<...<p_{10}. Estimate the dose at which the animals have at least 50% chance of dying.

Let \displaystyle \delta=x_j where \displaystyle j=min\{i|p_i\geq0.5\}
We are interested in \displaystyle \delta since any higher concentrations are known to have a higher death rate.

Solving this analytically is difficult:

\displaystyle \delta = g(p_1, p_2, ..., p_{10}) where g is an unknown function
\displaystyle \hat{\delta} = \int \int..\int_A \delta f(p_1,p_2,...,p_{10}|Y_1,Y_2,...,Y_{10})\,dp_1dp_2...dp_{10}
where \displaystyle A=\{(p_1,p_2,...,p_{10})|p_1\leq p_2\leq ...\leq p_{10} \}

Process: Monte Carlo
We assume that

  1. Draw \displaystyle p_i \sim~ BETA(y_i+1, n-y_i+1)
  2. Keep sample only if it satisfies \displaystyle p_1\leq p_2\leq ...\leq p_{10}, otherwise discard and try again.
  3. Compute \displaystyle \delta by finding the first \displaystyle p_i sample with over 50% deaths.
  4. Repeat process n times to get n estimates for \displaystyle \delta_1, \delta_2, ..., \delta_N .
  5. \displaystyle \bar{\delta} = \frac{\displaystyle\sum_{\forall i} \delta_i}{N}.

For instance, for each dose level Xi, for 1 < = i < = 10, 10 rats are used and it is observed that the numbers that are dying is Yi, where Y1 = 4,Y2 = 3,etc.

Importance Sampling

'A diagram of a Monte Carlo Approximation of g(x)[Importance Sampling, Wikipedia]

In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. As shown in the figure, the uniform distribution U\sim~Unif[0,1] is a proposal distribution to sample from and g(x) is the target distribution.

Here we cast the integral \int_{0}^1 g(x)dx, as the expectation of g(x) with respect to U such that,

I = \int_{0}^1 g(x)= E_U(g(x)).

Hence we can approximate this integral by,

\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^{N} g(u_i).

[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL:]

In I = \displaystyle\int h(x)f(x)\,dx, Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then \ I can be written as

\begin{align}I &= \int h(x)f(x)\,dx = \int \frac{h(x)f(x)}{g(x)}g(x)\,dx \\
 &= E_g(w(x)) \end{align}

the expectation of \displaystyle w(x) = \frac{h(x)f(x)}{g(x)} with respect to g(x) and therefore \displaystyle x_1,x_2,...,x_N \sim~ g
 \hat{I}= \frac{1}{N}\sum_{i=1}^{N} w(x_i)


  1. Choose \displaystyle g(x) such that it's easy to sample from and draw x_i \sim~ g(x)
  2. Compute  w(x_i)=\frac{h(x_i)f(x_i)}{g(x_i)}
  3.  \hat{I} = \frac{1}{N}\sum_{i=1}^{N} w(x_i)

Note: By the law of large numbers, we can say that \hat{I} converges in probability to I.

"Weighted" average

The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from \displaystyle g(x) that are closer to the original distribution \displaystyle f(x), which we would ideally like to sample from (but cannot because it is too difficult).
\displaystyle I = \int\frac{h(x)f(x)}{g(x)}g(x)\,dx
=\displaystyle \int \frac{f(x)}{g(x)}h(x)g(x)\,dx

\displaystyle \int \frac{f(x)}{g(x)}E_g(h(x))\,dx which is like saying that we are applying a regular Monte Carlo Simulation method to \displaystyle\int h(x)g(x)\,dx , and taking each result from this process and weighting the more accurate ones (i.e. the ones for which \displaystyle \frac{f(x)}{g(x)} is high) higher.

One can view  \frac{f(x)}{g(x)}\ = B(x) as a weight.

Then \displaystyle \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^{N} w(x_i) = \frac{1}{N}\displaystyle\sum_{i=1}^{N} B(x_i)h(x_i)

i.e. we are computing a weighted sum of h(xi) instead of a sum

A Deeper Look into Importance Sampling - June 2, 2009

From last class, we have determined that an integral can be written in the form I = \displaystyle\int h(x)f(x)\,dx = \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx We continue our discussion of Importance Sampling here.

Importance Sampling

We can see that the integral \displaystyle\int \frac{h(x)f(x)}{g(x)}g(x)\,dx = \int \frac{f(x)}{g(x)}h(x) g(x)\,dx is just  \displaystyle E_g(h(x)) \rightarrowthe expectation of h(x) with respect to g(x), where \displaystyle \frac{f(x)}{g(x)} is a weight \displaystyle\beta(x). In the case where \displaystyle f > g, a greater weight for \displaystyle\beta(x) will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.


The method of Importance Sampling is simple but can lead to some problems. The  \displaystyle \hat I estimated by Importance Sampling could have infinite standard error.

Given \displaystyle I= \int w(x) g(x) dx = \displaystyle E_g(w(x)) = \displaystyle \frac{1}{N}\sum_{i=1}^{N} w(x_i) where \displaystyle w(x)=\frac{f(x)h(x)}{g(x)} .

Obtaining the second moment,

\displaystyle E[(w(x))^2]
\displaystyle = \int (\frac{h(x)f(x)}{g(x)})^2 g(x) dx
\displaystyle = \int \frac{h^2(x) f^2(x)}{g^2(x)} g(x) dx
\displaystyle = \int \frac{h^2(x)f^2(x)}{g(x)} dx

We can see that if \displaystyle g(x) \rightarrow 0 , then \displaystyle E[(w(x))^2] \rightarrow \infty . This occurs if \displaystyle g has a thinner tail than \displaystyle f then \frac{h^2(x)f^2(x)}{g(x)} could be infinitely large. The general idea here is that \frac{f(x)}{g(x)} should not be large.

Remark 1

It is evident that \displaystyle g(x) should be chosen such that it has a thicker tail than \displaystyle f(x) . If \displaystyle f is large over set \displaystyle A but \displaystyle g is small, then \displaystyle \frac{f}{g} would be large and it would result in a large variance.

Remark 2

It is useful if we can choose \displaystyle g to be similar to \displaystyle f in terms of shape. Ideally, the optimal \displaystyle g should be similar to \displaystyle \left| h(x) \right|f(x), and have a thicker tail. It's important to take the absolute value of \displaystyle h(x), since a variance can't be negative. Analytically, we can show that the best \displaystyle g is the one that would result in a variance that is minimized.

Remark 3

Choose \displaystyle g such that it is similar to \displaystyle \left| h(x) \right| f(x) in terms of shape. That is, we want \displaystyle g \propto \displaystyle \left| h(x) \right| f(x)

Theorem (Minimum Variance Choice of \displaystyle g)

The choice of \displaystyle g that minimizes variance of \hat I is \displaystyle g^*(x)=\frac{\left| h(x) \right| f(x)}{\int \left| h(s) \right| f(s) ds}.


We know that \displaystyle w(x)=\frac{f(x)h(x)}{g(x)}

The variance of \displaystyle w(x) is

\displaystyle Var[w(x)]
\displaystyle = E[(w(x)^2)] - [E[w(x)]]^2
\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int \frac{f(x)h(x)}{g(x)}g(x)dx \right]^2
\displaystyle = \int \left(\frac{f(x)h(x)}{g(x)} \right)^2 g(x) dx - \left[\int f(x)h(x) \right]^2

As we can see, the second term does not depend on \displaystyle g(x) . Therefore to minimize \displaystyle Var[w(x)] we only need to minimize the first term. In doing so we will use Jensen's Inequality.

\displaystyle ======Aside: Jensen's Inequality======

If \displaystyle g is a convex function ( twice differentiable and \displaystyle g''(x) \geq 0 ) then \displaystyle g(\alpha x_1 + (1-\alpha)x_2) \leq \alpha g(x_1) + (1-\alpha) g(x_2)
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:

\displaystyle g(\alpha_1 x_1 + \alpha_2 x_2 + ... + \alpha_n x_n) \leq \alpha_1 g(x_1) + \alpha_2 g(x_2) + ... + \alpha_n g(x_n) where \displaystyle \alpha_1 + \alpha_2 + ... + \alpha_n = 1
Proof (cont)

Using Jensen's Inequality,

\displaystyle g(E[x]) \leq E[g(x)] as \displaystyle g(E[x]) = g(p_1 x_1 + ... p_n x_n) \leq p_1 g(x_1) + ... + p_n g(x_n) = E[g(x)]


\displaystyle E[(w(x))^2] \geq (E[\left| w(x) \right|])^2
\displaystyle E[(w(x))^2] \geq \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2


\begin{align} \displaystyle \left(\int \left| \frac{f(x)h(x)}{g(x)} \right| g(x) dx \right)^2 &= \left(\int \frac{f(x)\left| h(x) \right|}{g(x)} g(x) dx \right)^2 \\ &= \left(\int \left| h(x) \right| f(x) dx \right)^2 \end{align}

since \displaystyle f and \displaystyle g are density functions, \displaystyle f, g are non negative.

Thus, this is a lower bound on \displaystyle E[(w(x))^2]. If we replace \displaystyle g^*(x) into \displaystyle E[g^*(x)], we can see that the result is as we require. Details omitted.

However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute \displaystyle g^*.

Note: Jensen's inequality is actually unnecessary here. We just use it to get E[(w(x))^2] \geq (E[|w(x)|])^2, which could be derived using variance properties: 0 \leq Var[|w(x)|] = E[|w(x)|^2] - (E[|w(x)|])^2 = E[(w(x))^2] - (E[|w(x)|])^2.

Importance Sampling and Markov Chain Monte Carlo (MCMC) - June 4, 2009

Remark 4:

\begin{align} I &= \int^\ h(x)f(x)\,dx \\
         &= \int \ h(x)\frac{f(x)}{g(x)}g(x)\,dx \\
       \hat{I}&=\frac{1}{N} \sum_{i=1}^{N} h(x_i)b(x_i)\end{align}

where \displaystyle b(x_i) = \frac{f(x_i)}{g(x_i)}

I =\displaystyle \frac{\int\ h(x)f(x)\,dx}{\int f(x) dx}

Apply the idea of importance sampling to both the numerator and denominator:

=\displaystyle \frac{\int\ h(x)\frac{f(x)}{g(x)}g(x)\,dx}{\int\frac{f(x)}{g(x)}g(x) dx}
= \displaystyle\frac{\sum_{i=1}^{N} h(x_i)b(x_i)}{\sum_{1=1}^{N} b(x_i)}
= \displaystyle\sum_{i=1}^{N} h(x_i)b'(x_i) where \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}

The above results in the following form of Importance Sampling:

 \hat{I} = \displaystyle\sum_{i=1}^{N} b'(x_i)h(x_i) where \displaystyle b'(x_i) = \frac{b(x_i)}{\sum_{i=1}^{N} b(x_i)}

This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.

Example of Importance Sampling

Estimate  I = \displaystyle\ Pr (Z>3) when Z \sim~ N(0,1)

\begin{align} Pr(Z>3) &= \int^\infty_3 f(x)\,dx \approx 0.0013 \\
   &= \int^\infty_{-\infty} h(x)f(x)\,dx \end{align}
h(x) = \begin{cases}
0, & \text{if } x < 3 \\
1, & \text{if } x \ge 3

Approach I: Monte Carlo

\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i) where X \sim~ N(0,1)

The idea here is to sample from normal distribution and to count number of observations that is greater than 3.

The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.

This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):

format long
for i = 1:100
   a(i) = sum(randn(100,1)>=3)/100;
meanMC  = mean(a)
varMC   = var(a)

On running this, we get meanMC = 0.0005 and  varMC \approx 1.31313 * 10^{-5}

Approach II: Importance Sampling

\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\frac{f(x_i)}{g(x_i)} where f(x) is standard normal and g(x) needs to be chosen wisely so that it is similar to the target distribution.
Let g(x) \sim~ N(4,1)
b(x) = \frac{f(x)}{g(x)} = e^{(8-4x)}
\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nb(x_i)h(x_i)

This example can be illustrated in Matlab using the code below:

for j = 1:100
   N = 100;
   x = randn (N,1) + 4;
     for ii = 1:N
         h = x(ii)>=3;
         b = exp(8-4*x(ii));
         w(ii) = h*b;
  I(j) = sum(w)/N;
MEAN = mean(I)
VAR = var(I)

Running the above code gave us  MEAN \approx 0.001353 and  VAR \approx 9.666 * 10^{-8} which is very close to 0, and is much less than the variability observed when using Monte Carlo

Markov Chain Monte Carlo (MCMC)

Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain.

Consider the same integral:  I = \displaystyle\int^\ h(x)f(x)\,dx

Idea: If \displaystyle X_1, X_2,...X_N is a Markov Chain with stationary distribution f(x), then under some conditions

\hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\xrightarrow{P}\int^\ h(x)f(x)\,dx = I

Stochastic Process:
A Stochastic Process is a collection of random variables \displaystyle \{ X_t : t \in T \}

  • State Space Set:\displaystyle X is the set that random variables \displaystyle X_t takes values from.
  • Indexed Set:\displaystyle T is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.

Example 1
i.i.d random variables

 \{ X_t : t \in T \}, X_t \in X
 X = \{0, 1, 2, 3, 4, 5, 6, 7, 8\} \rightarrowState Space
 T = \{1, 2, 3, 4, 5\} \rightarrowIndexed Set

Example 2

\displaystyle X_t: price of a stock
\displaystyle t: opening date of the market

Basic Fact: In general, if we have random variables \displaystyle X_1,...X_n

\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2,X_1)...f(X_n|X_n-1,...,X_1)
\displaystyle f(X_1,...X_n)= \prod_{i = 1}^n f(X_i|Past_i) where \displaystyle Past_i = (X_{i-1}, X_{i-2},...,X_1)

Markov Chain:
A Markov Chain is a special form of stochastic process in which \displaystyle X_t depends only on  \displaystyle X_{t-1}.

For example,

\displaystyle f(X_1,...X_n)= f(X_1)f(X_2|X_1)f(X_3|X_2)...f(X_n|X_{n-1})

Transition Probability:
The probability of going from one state to another state.

p_{ij} = \Pr(X_{n}=j\mid X_{n-1}= i). \,

Transition Matrix:
For n states, transition matrix P is an N \times N matrix with entries \displaystyle P_{ij} as below:


A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is \displaystyle Pr(heads) = p. And the probability of moving to the left is \displaystyle Pr(tails) = q = 1-p . Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:


Markov Chain Definitions - June 9, 2009

Practical application for estimation: The general concept for the application of this lies within having a set of generated xi which approach a distribution f(x) so that a variation of importance estimation can be used to estimate an integral in the form
 I = \displaystyle\int^\ h(x)f(x)\,dx by \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)
All that is required is a Markov chain which eventually converges to f(x).

In the previous example, the entries pij in the transition matrix P represent the probability of reaching state j from state i after one step. For this reason, the sum over all entries in a particular row sum to 1, as this itself must be a pmf if a transition from i is to lead to a state still within the state set for Xt.

Homogeneous Markov Chain
The probability matrix P is the same for all indicies n\in T. \displaystyle Pr(X_n=j|X_{n-1}=i)= Pr(X_1=j|X_0=i)

If we denote the pmf of Xn by a probability vector \frac{}{}\mu_n = [P(X_n=x_1),P(X_n=x_2),..,P(X_n=x_i)]
where i denotes an ordered index of all possible states of X.
Then we have a definition for the
marginal probabilty \frac{}{}\mu_n(i) = P(X_n=i)
where we simplify Xn to represent the ordered index of a state rather than the state itself.

From this definition it can be shown that, \frac{}{}\mu_{n-1}P=\mu_0P^n


\mu_{n-1}P=[\sum_{\forall i}(\mu_{n-1}(i))P_{i1},\sum_{\forall i}(\mu_{n-1}(i))P_{i2},..,\sum_{\forall i}(\mu_{n-1}(i))P_{ij}] and since

\sum_{\forall i}(\mu_{n-1}(i))P_{ij}=\sum_{\forall i}P(X_{n-1}=i)Pr(X_n=j|X_{n-1}=i)=\sum_{\forall i}P(X_{n-1}=i)\frac{Pr(X_n=j,X_{n-1}=i)}{P(X_{n-1}=i)} =\sum_{\forall i}Pr(X_n=j,X_{n-1}=i)=Pr(X_n=j)=\mu_{n}(j)


With this, it is possible to define P(n) as an n-step transition matrix where \frac{}{}P_{ij}(n) = Pr(X_n=j|X_0=i)

Theorem: \frac{}{}\mu_n=\mu_0P^n
Proof: \frac{}{}\mu_n=\mu_{n-1}P From the previous conclusion
\frac{}{}=\mu_{n-2}PP=...=\mu_0\prod_{i = 1}^nP And since this is a homogeneous Markov chain, P does not depend on i so

From this it becomes easy to define the n-step transition matrix as \frac{}{}P(n)=P^n

Summary of definitions

  • transition matrix is an NxN when N = | X | matrix with Pij = Pr(X1 = j | X0 = i) where i,j \in X
  • n-step transition matrix also NxN with Pij(n) = Pr(Xn = j | X0 = i)
  • marginal (probability of X)μn(i) = Pr(Xn = i)
  • Theorem: Pn = Pn
  • Theorem: μn = μ0Pn


Definitions of different types of state sets

Define i,j \in State Space
If Pij(n) > 0 for some n , then we say i reaches j denoted by i\longrightarrow j
This also mean j is accessible by i: j\longleftarrow i
If i\longrightarrow j and j\longrightarrow i then we say i and j communicate, denoted by i\longleftrightarrow j

1) i\longleftrightarrow i
2) i\longleftrightarrow j \Rightarrow j\longleftrightarrow i
3) If i\longleftrightarrow j,j\longleftrightarrow k\Rightarrow i\longleftrightarrow k
4) The set of states of X can be written as a unique disjoint union of subsets (equivalence classes) X=X_1\bigcup X_2\bigcup ...\bigcup X_k,k>0 where two states i and j communicate IFF they belong to the same subset

More Definitions
A set is Irreducible if all states communicate with each other (has only one equivalence class).
A subset of states is Closed if once you enter it, you can never leave.
A subset of states is Open if once you leave it, you can never return.
An Absorbing Set is a closed set with only 1 element (i.e. consists of a single state).


  • We cannot have \displaystyle i\longleftrightarrow j with i recurrent and j transient since \displaystyle i\longleftrightarrow j \Rightarrow j\longleftrightarrow i.
  • All states in an open class are transient.
  • A Markov Chain with a finite number of states must have at least 1 recurrent state.
  • A closed class with an infinite number of states has all transient or all recurrent states.

Again on Markov Chain - June 11, 2009

Decomposition of Markov chain

In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.

Let \displaystyle X = \{1, 2, 3, 4\} and the transition matrix be:


The decomposition in classes is:

class 1: \displaystyle \{1, 2\} \rightarrow From the matrix we see that the states 1 and 2 have only \displaystyle P_{12} and \displaystyle P_{21} as nonzero transition probability
class 2: \displaystyle \{3\} \rightarrow The state 3 can go to every other state but none of the others can go to it
class 3: \displaystyle \{4\} \rightarrow This is an absorbing state since it is a close class and there is only one element

Recurrent and Transient states

A state i is called \emph{recurrent} or \emph{persistent} if

\displaystyle Pr(x_{n}=i for some \displaystyle n\geq 1 | x_{0}=i)=1

That means that the probability to come back to the state i, starting from the state i, is 1.

If it is not the case (ie. probability less than 1), then state i is \emph{transient} .

It is straight forward to prove that a finite irreducible chain is recurrent.

Given a Markov chain,
A state \displaystyle i is \emph{recurrent} if and only if \displaystyle \sum_{\forall n}P_{ii}(n)=\infty
A state \displaystyle i is \emph{transient} if and only if \displaystyle \sum_{\forall n}P_{ii}(n)< \infty


  • If \displaystyle i is \emph{recurrent} and i\longleftrightarrow j then \displaystyle j is \emph{recurrent}
  • If \displaystyle i is \emph{transient} and i\longleftrightarrow j then \displaystyle j is \emph{transient}
  • In an equivalence class, either all states are recurrent or all states are transient
  • A finite Markov chain should have at least one recurrent state
  • The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)

In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.

Let \displaystyle X=\{\cdots,-2,-1,0,1,2,\cdots\} and suppose that \displaystyle P_{i,i+1}=p , \displaystyle P_{i,i-1}=q=1-p and \displaystyle P_{i,j}=0 otherwise. This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.

We now see other properties of this particular Markov chain:

  • Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.
  • Consider now the case in which we are in state \displaystyle 0. If we move of n steps to the right or to the left, the only way to go back to \displaystyle 0 is to have n steps on the opposite direction.

\displaystyle Pr(x_{2n}=0|X_{0}=0)=P_{00}(2n)=[ {2n \choose n} ]p^{n}q^{n} We now want to know if this event is transient or recurrent or, equivalently, whether \displaystyle \sum_{\forall i}P_{ii}(n)\geq\infty or not.

To proceed with the analysis, we use the \emph{Stirling } \displaystyle\emph{formula}:

\displaystyle n!\sim~n^{n}\sqrt{n}e^{-n}\sqrt{2\pi}

The probability could therefore be approximated by:

\displaystyle P_{00}(n)=\sim~\frac{(4pq)^{n}}{\sqrt{n\pi}}

And the formula becomes:

\displaystyle \sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt{n\pi}}

We can conclude that if \displaystyle 4pq < 1 then the state is transient, otherwise is recurrent.

\sum_{\forall n}P_{00}(n)=\sum_{\forall n}\frac{(4pq)^{n}}{\sqrt{n\pi}} = \begin{cases}
\infty, & \text{if } p = \frac{1}{2} \\
< \infty, & \text{if } p\neq \frac{1}{2} 

An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:

\frac{1}{\sqrt{1 - 4x}} = \sum_{n=0}^{\infty} \binom{2n}{n} x^n

Then substitute in x = pq.

\frac{1}{\sqrt{1 - 4pq}} = \sum_{n=0}^{\infty} \binom{2n}{n} p^n q^n = \sum_{n=0}^{\infty} P_{00}(2n)

So we reach the same conclusion: all states are recurrent iff p = q = \frac{1}{2}.

Convergence of Markov chain

We define the \displaystyle \emph{Recurrence} \emph{time}\displaystyle T_{i,j} as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.

\displaystyle T_{ij}=\begin{cases}
min\{n: x_{n}=i\}, & \text{if }\exists n \\
\infty, & \text{otherwise } 

The mean of the recurrent time \displaystyle m_{i}is defined as:

m_{i}=\displaystyle E(T_{ij})=\sum nf_{ii}

where \displaystyle f_{ij}=Pr(x_{1}\neq j,x_{2}\neq j,\cdots,x_{n-1}\neq j,x_{n}=j/x_{0}=i)

Using the objects we just introduced, we say that:

\displaystyle \text{state } i=\begin{cases}
\text{null}, & \text{if } m_{i}=\infty \\
\text{non-null or positive} , & \text{otherwise } 

In a finite state Markov Chain, all the recurrent state are positive

Periodic and aperiodic Markov chain

A Markov chain is called \emph{periodic} of period \displaystyle n if, starting from a state, we will return to it every \displaystyle n steps with probability \displaystyle 1 we can only return to that state in a multiple of n steps.

Considerate the three-state chain:


It's evident that, starting from the state 1, we will return to it on every 3rd step and so it works for the other two states. The chain is therefore periodic with perdiod d = 3

An irreducible Markov chain is called \emph{aperiodic} if:

\displaystyle Pr(x_{n}=j | x_{0}=i) > 0 \text{ and } Pr(x_{n+1}=j | x_{0}=i) > 0 \text{    for some } n\ge 0

Another Example
Consider the chain P=\left(\begin{matrix}

This chain is periodic by definition. You can only get back to state 1 after at least 2 steps \Rightarrow period d = 2

Markov Chains and their Stationary Distributions - June 16, 2009

New Definition:Ergodic

A state is Ergodic if it is non-null, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.

Define a vector π where \pi_i > 0 \forall i and

πi = 1

(ie. π is a pmf)

π is a stationary distribution if π = πP where P is a transition matrix.

Limiting Distribution

If as n \longrightarrow \infty , P^n \longrightarrow \left[ \begin{matrix}
\end{matrix}\right] then π is the limiting distribution of the Markov Chain represented by P.
Theorem: An irreducible, ergodic Markov Chain has a unique stationary distribution π and there exists a limiting distribution which is also π.

Detailed Balance

The condition for detailed balanced is \displaystyle \pi_i p_{ij} = p_{ji} \pi_j


If π satisfies detailed balance then it is a stationary distribution.

We need to show π = πP \displaystyle [\pi p]_j = \sum_{i} \pi_i p_{ij} = \sum_{i} p_{ji} \pi_j = \pi_j \sum_{i} p_{ji}= \pi_j as required

Warning! A chain that has a stationary distribution does not necessarily converge.

For example, P=\left(\begin{matrix}
\end{matrix}\right) has a stationary distribution \left(\begin{matrix}
\end{matrix}\right) but it will not converge.

Stationary Distribution

π is stationary (or invariant) distribution if π = π * p [0.5 0 0.5] Half of time their chain will spend half time in 1st state and half time in 3rd state.


An irreducible ergodic Markov Chain has a unique stationary distribution π. The limiting distribution exists and is equal to π.

If g is any bounded function, then with probability 1: lim \frac{1}{N}\displaystyle\sum_{i=1}^Ng(x_n)\longrightarrow E_n(g)=\displaystyle\sum_{j}g(j)\pi_j


Find the limiting distribution of P=\left(\begin{matrix}

Solve π = πP

\displaystyle \pi_0 = 1/2\pi_0 + 1/2\pi_1
\displaystyle \pi_1 = 1/2\pi_0 + 1/4\pi_1 + 1/3\pi_2
\displaystyle \pi_2 = 1/4\pi_1 + 2/3\pi_2

Also \displaystyle \sum_i \pi_i = 1 \longrightarrow \pi_0 + \pi_1 + \pi_2 = 1

We can solve the above system of equations and obtain
\displaystyle \pi_2 = 3/4\pi_1
\displaystyle \pi_0 = \pi_1

Thus, \displaystyle \pi_0 + \pi_1 + 3/4\pi_1 = 1 and we get \displaystyle \pi_1 = 4/11

Subbing \displaystyle \pi_1 = 4/11 back into the system of equations we obtain
\displaystyle \pi_0 = 4/11 and \displaystyle \pi_2 = 3/11

Therefore the limiting distribution is \displaystyle \pi = (4/11, 4/11, 3/11)

Monte Carlo using Markov Chain - June 18, 2009

Consider the problem of computing  I = \displaystyle\int^\ h(x)f(x)\,dx

\bullet Generate \displaystyle X_1, \displaystyle X_2,... from a Markov Chain with stationary distribution \displaystyle f(x)

\bullet \hat{I} = \frac{1}{N}\displaystyle\sum_{i=1}^Nh(x_i)\longrightarrow E_f(h(x))=\hat{I}

Metropolis Hastings Algorithm

The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.

Suppose we wish to sample from f(x), the 'target' distribution. Choose q(y | x), the 'proposal' distribution that is easily sampled from.


1. Initialize \displaystyle X_0, this is the starting point of the chain, choose it randomly and set index \displaystyle i=0

2. Y~ \sim~ q(y\mid{x})

3. Compute \displaystyle r(X_i,Y), where r(x,y)=min{\{\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})},1}\}

4.  U~ \sim~ Unif [0,1]

5. If \displaystyle U<r
then \displaystyle X_{i+1}=Y
else \displaystyle X_{i+1}=X_i

6. Update index \displaystyle i=i+1, and go to step 2

A couple of remarks about the algorithm

Remark 1: A good choice for q(y\mid{x}) is \displaystyle N(x,b^2) where \displaystyle b>0 is a constant. The starting point of the algorithm X0 = x, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.

Remark 2: If the proposal distribution is symmetric, q(y\mid{x})=q(x\mid{y}), then r(x,y)=min{\{\frac{f(y)}{f(x)},1}\}. This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).

Remark 3: \displaystyle N(x,b^2) is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.

Example: The Cauchy distribution has density  f(x)=\frac{1}{\pi}*\frac{1}{1+x^2}

Let the proposal distribution be q(y\mid{x})=N(x,b^2)


=min{\{\frac{f(y)}{f(x)},1}\} since q(y\mid{x}) is symmetric \Rightarrow \frac{q(x\mid{y})}{q(y\mid{x})}=1
=min{\{\frac{  \frac{1}{\pi}\frac{1}{1+y^2} }{  \frac{1}{\pi} \frac{1}{1+x^2} },1}\}
=min{\{\frac{1+x^2 }{1+y^2},1}\}

Now, having calculated \displaystyle r(x,y), we complete the problem in Matlab using the following code:

b=2; % let b=2 for now, we will see what happens when b is smaller or larger
for i=2:10000
   Y=b*randn+X(i-1); % we want to decide whether we accept this Y
   r=min( (1+X(i-1)^2)/(1+Y^2),1); 
   if u<r
       X(i)=Y; % accept Y
       X(i)=X(i-1); % reject Y remaining in the current state

We need to be careful about choosing b!

If b is too large
Then the fraction \frac{f(y)}{f(x)} would be very small \Rightarrow r=min{\{\frac{f(y)}{f(x)},1}\} is very small aswell.
It is highly unlikely that \displaystyle u<r, the probability of rejecting \displaystyle Y is high so the chain is likely to get stuck in the same state for a long time \rightarrow chain may not coverge to the right distribution.
It is easy to observe by looking at the histogram of \displaystyle X, the shape will not resemble the shape of the target \displaystyle f(x)
Most likely we reject y and the chain will get stuck.
If b is too small
Then we are setting up our proposal distribution q(y\mid{x}) to be much narrower then than the target \displaystyle f(x) so the chain will not have a chance to explore the sample state space and visit majority of the states of the target \displaystyle f(x).
If b is just right
Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".

Mathematical explanation for why this algorithm works:

We talked about \emph{discrete} MC so far.

We have seen that:
- \displaystyle \pi satisfies detailed balance if \displaystyle \pi_iP_{ij}=P_{ji}\pi_j and
- if \displaystyle\pi satisfies \emph{detailed} \emph{balance}then it is a stationary distribution \displaystyle \pi=\pi P

In \emph{continuous}case we write the Detailed Balance as \displaystyle f(x)P(x,y)=P(y,x)f(y) and say that
\displaystyle f(x) is \emph{stationary} \emph{distribution} if f(x)=\int f(y)P(y,x)dy.

We want to show that if Detailed Balance holds (i.e. assume \displaystyle f(x)P(x,y)=P(y,x)f(y)) then \displaystyle f(x) is stationary distribution.

That is to show: \displaystyle f(x)P(x,y)=P(y,x)f(y)\Rightarrow \displaystyle f(x) is stationary distribution.

f(x)=\int f(y)P(y,x)dy
=\int f(x)P(x,y)dy
=f(x)\int P(x,y)dy and since \int P(x,y)dy=1
=\displaystyle f(x)

Now, we need to show that detailed balance holds in the Metropolis-Hastings...

Consider 2 points \displaystyle x and \displaystyle y:

Either \frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}>1 OR \frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})}<1 (ignoring that it might equal to 1)

Without loss of generality. suppose that the product is \displaystyle<1.

In this case r(x,y)=\frac{f(y)}{f(x)}*\frac{q(x\mid{y})}{q(y\mid{x})} and \displaystyle r(y,x)=1

Some intuitive meanings before we continue:
\displaystyle P(x,y) is jumping from \displaystyle x to \displaystyle y if proposal distribution generates \displaystyle y and \displaystyle y is accepted
\displaystyle P(y,x) is jumping from \displaystyle y to \displaystyle x if proposal distribution generates \displaystyle x and \displaystyle x is accepted
q(y\mid{x}) is the probability of generating \displaystyle y
q(x\mid{y}) is the probability of generating \displaystyle x
\displaystyle r(x,y) probability of accepting \displaystyle y
\displaystyle r(y,x) probability of accepting \displaystyle x.

With that in mind we can show that \displaystyle f(x)P(x,y)=P(y,x)f(y) as follows:

P(x,y)=q(y\mid{x})*(r(x,y))=q(y\mid{x})\frac{f(y)}{f(x)}\frac{q(x\mid{y})}{q(y\mid{x})} Cancelling out \displaystyle q(y\mid{x}) and bringing \displaystyle f(x) to the other side we get
f(x)P(x,y)=f(y)q(x\mid{y}) \Leftarrow equation \clubsuit

P(y,x)=q(x\mid{y})*(r(y,x))=q(x\mid{y})*1 Multiplying both sides by \displaystyle f(y) we get
f(y)P(y,x)=f(y)q(x\mid{y}) \Leftarrow equation \clubsuit\clubsuit

Noticing that the right hand sides of the equation \clubsuit and equation \clubsuit\clubsuit are equal we conclude that:

\displaystyle f(x)P(x,y)=P(y,x)f(y) as desired and thus showing that Metropolis-Hastings satisfies detailed balance.

Next lecture we will see that Metropolis-Hastings is also irreducible and ergodic thus showing that it converges.

Metropolis Hastings Algorithm Continued - June 25

Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (1915-1999), also co-author of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.

In the last class, we showed that Metropolis Hastings satisfied the the detail-balance equations. i.e.

\displaystyle f(x)P(x,y)=P(y,x)f(y), which means \displaystyle f(x) is the stationary distribution of the chain.

But this is not enough, we want the chain to converge to the stationary distribution as well.

Thus, we also need it to be:

Irreducible: There is a positive probability to reach any non-empty set of states from any starting point. This is trivial for many choice of \emph{q} including the one that we used in the example in the previous lecture (which was normally distributed)

Aperiodic: The chain will not oscillate between different set of states. In the previous example,  q(y\mid{x}) is  \displaystyle N(x,b^2), which will clearly not oscillate.

Next we discuss a couple of variations of Metropolis Hastings

Random Walk Metropolis Hastings


1. Draw \displaystyle Y = X_i + \epsilon, where \displaystyle \epsilon has distribution \displaystyle g ; \epsilon = Y-X_i \sim~ g ; \displaystyle X_i is current state & \displaystyle Y is going to be close to \displaystyle X_i

2. It means q(y\mid{x}) = g(y-x). (Note that \displaystyle g is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form \displaystyle g(|y-x|) . Hence we know in this version that \displaystyle q is symmetric \Rightarrow q(y\mid{x}) = g(|y-x|) = g(|x-y|) = q(x\mid{y}))

3. r=min{\{\frac{f(y)}{f(x)},1}\}

Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was  q(y\mid{x}) \sim~ N(x,b^2)

In matlab, we defined this as

\displaystyle Y = b* randn + x (i.e \displaystyle Y = X_i + randn*b)

In this case, we need \displaystyle \epsilon \sim~ N(0,b^2)

The hard problem is to choose b so that the chain will mix well.

Rule of thumb: choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)



If we draw \displaystyle y_1 then {\frac{f(y_1)}{f(x)}} > 1 \Rightarrow min{\{\frac{f(y_1)}{f(x)},1}\} = 1, accept \displaystyle y_1 with probability 1

If we draw \displaystyle y_2 then {\frac{f(y_2)}{f(x)}} < 1 \Rightarrow min{\{\frac{f(y_2)}{f(x)},1}\} = \frac{f(y_2)}{f(x)}, accept \displaystyle y_2 with probability \frac{f(y_2)}{f(x)}

Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.

Independence Metropolis Hastings

In this case, the proposal distribution is independent of the current state, i.e. \displaystyle q(y\mid{x}) = q(y)

We draw from a fixed distribution

And define r = min{\{\frac{f(y)}{f(x)} \cdot \frac{q(x)}{q(y)},1}\}

And, this does not work unless \displaystyle q is very similar to the target distribution \displaystyle f (i.e. usually used when \displaystyle f is known up to a proportionality constant - the form of the distibution is known, but the distribution is not exactly known)

Now, we pose the question: if \displaystyle q(y\mid{x}) does not depend on \displaystyle X, does it mean that the sequence generated from this chain is really independent?

Answer: Even though  Y \sim~ g(y) does not depend on \displaystyle X , but \displaystyle r depends on \displaystyle X . So it's not really an independent sequence!

x_{i+1} = \begin{cases}
x_i \\
y \\

Thus, the sequence is not really independent because acceptance probability \displaystyle r depends on the state \displaystyle X_i

Simulated Annealing

Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.

Consider the problem where we want to find x such that the objective function h(x) is at it's minimum,

\ \min_{x}(h(x))

Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,

\ \max_{x}(e^{\frac{-h(x)}{T}})

We consider a distribution function \displaystyle f such that \displaystyle f \propto e^{\frac{-h(x)}{T}}, where T is called the temperature, and define the following procedure:

  1. Set T to be a large number
  2. Start with some random \displaystyle X_0, \displaystyle i = 0
  3.  Y \sim~ q(y\mid{x}) (note that \displaystyle  q is usually chosen to be symmetric)
  4.  U \sim~ Unif[0,1]
  5. Define r = min{\{\frac{f(y)}{f(x)},1}\} (when \displaystyle  q is symmetric)
  6. X_{i+1} = \begin{cases}
Y, & \text{with probability r} \\
X_i & \text{with probability 1-r}\\

    IE. Xi + 1 = Y if U < r

  7. Decrease T and go to Step 2

Now, we know that r = min{\{\frac{f(y)}{f(x)},1}\}

Consider  \frac{f(y)}{f(x)}

= \frac{e^{\frac{-h(y)}{T}}}{e^{\frac{-h(x)}{T}}}
= e^{\frac{h(x)-h(y)}{T}}

Now, suppose T is large,

\rightarrow if  h(y)<h(x) \Rightarrow r = 1 and we therefore accept \displaystyle y with probability 1

\rightarrow if  h(y)>h(x) \Rightarrow r = e^{\frac{h(x)-h(y)}{T}} < 1 and we therefore accept \displaystyle y with probability \displaystyle <1

On the other hand, suppose T is arbitrarily small ( T \rightarrow 0 ),

\rightarrow if  h(y)<h(x) \Rightarrow r = 1 and we therefore accept \displaystyle y with probability 1

\rightarrow if  h(y)>h(x) \Rightarrow r = 0 and we therefore reject \displaystyle y

Example 1

Consider the problem of minimizing the function \displaystyle f(x) = -2x^3 - x^2 + 40x + 3

We can plot this function and observe that it makes a local minimum near \displaystyle x = -3


We then plot the graphs of \displaystyle \frac{f(x)}{T} for \displaystyle T = 100, 0.1 and observe that the distribution expands for a large T, and contracts for T small - i.e. T plays the role of variance - making the distribution expand and contract accordingly.

In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).

In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.

Example 2 (from June 30th lecture)

Suppose we want to minimize the function \displaystyle f(x) = (x - 2)^2

Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use \displaystyle Y \sim~ N(x, b^2) and we begin with \displaystyle T = 10

Then the problem may be solved in MATLAB using the following:

function v = obj(x)
v = (x - 2).^2;
T = 10; %this is the initial value of T, which we must gradually decrease
b = 2;
X(1) = 0;
for i = 2:100 %as we change T, we will change i (e.g. i=101:200)
   Y = b*randn + X(i-1); 
   r = min(1 , exp(-obj(Y)/T)/exp(-obj(X(i-1))/T) );
   U = rand;
   if U < r
       X(i) = Y; %accept Y
       X(i) = X(i-1); %reject Y

The first run (with \displaystyle T = 10 ) gives us \displaystyle X = 1.2792

Next, if we let \displaystyle T = {9, 5, 2, 0.9, 0.1, 0.01, 0.005, 0.001} in the order displayed, then we get the following graph when we plot X:

SA Example2.jpg

i.e. it converges to the minimum of the function

Travelling Salesman Problem

This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).

The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.[2]

For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled Introduction to TSP - Infrastructure for the Travelling Salesman Problem. [2] The examples are particularly useful because they are implemented using R (a statistical computing software environment).

Gibbs Sampling - June 30, 2009

This algorithm is a specific form of Metropolis-Hastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).[3] In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to Metropolis-Hastings

\,f(x) p_{xy} = f(y) p_{yx}

This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with Metropolis-Hasting proof.


  • The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.
  • It is useful for high-dimensional distributions.


  • We rarely know how to sample from the conditional distributions.
  • The algorithm can be extremely slow to converge.
  • It is often difficult to know when convergence has occurred.
  • The method is not practical when there are relatively small correlations between the random variables.

Example: Gibbs sampling is used if we want to sample from a multidimensional distribution - i.e. \displaystyle f(x_1, x_2, ... , x_n)

We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing


x_1^* \sim~ f(x_1|x_2, x_3, ... , x_n)

x_2^* \sim~ f(x_2|x_1^*, x_3, ... , x_n)


x_n^* \sim~ f(x_n|x_1^*, x_2^*, ... , x_{n-1}^*)

and the resulting set of points drawn \displaystyle (x_1^*, x_2^*, \ldots, x_n^*) follows the required multivariate distribution.

Suppose we want to sample from a bivariate distribution \displaystyle f(x,y) with initial point \displaystyle(x_i, y_i) = (0,0) , i = 0
Furthermore, suppose that we know how to sample from the conditional distributions \displaystyle f_{X|Y}(x|y) and \displaystyle f_{Y|X}(y|x)


  1. \displaystyle Y_{i+1} \sim~ f_{Y_i|X_i}(y|x) (i.e. given the previous point, sample a new point)
  2. \displaystyle X_{i+1} \sim~ f_{X_{i}|Y_{i+1}}(x|y) (note: it must be \displaystyle Y_{i+1} not Yi, otherwise detailed balance may not hold)
  3. Repeat Steps 1 and 2

Note This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last \displaystyle X_i rather than all of them.

Example Suppose we want to generate samples from a bivariate normal distribution where \displaystyle \mu = \left[\begin{matrix} 1 \\ 2 \end{matrix}\right] and \sigma = \left[\begin{matrix} 1 & \rho \\ \rho & 1 \end{matrix}\right]

Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, \displaystyle f(x_2|x_1) \sim~ N(\mu_2 + \rho(x_1 - \mu_1), 1 - \rho^2) and \displaystyle f(x_1|x_2) \sim~ N(\mu_1 + \rho(x_2 - \mu_2), 1 - \rho^2)

The problem (for a specified value \displaystyle \rho) may be solved in MATLAB using the following:

Y = [1 ; 2];
rho = 0.01;
sigma = sqrt(1 - rho^2);
X(1,:) = [0 0];
for i = 2:5000
   mu = Y(1) + rho*(X(i-1,2) - Y(2));
   X(i,1) = mu + sigma*randn;
   mu = Y(2) + rho*(X(i-1,1) - Y(1));
   X(i,2) = mu + sigma*randn;
%plot(X(:,1),X(:,2),'.') plots all of the points
%plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points -> 
    this demonstrates that convergence occurs after a while 
    (this is called the burning time)

The output of plotting all points is:

Gibbs Sampling.jpg

Metropolis-Hastings within Gibbs Sampling - July 2

Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the Metropolis-Hastings algorithm.

  • Choose \displaystyle q as a proposal distribution for X (assuming Y fixed).
  • Choose \displaystyle \tilde{q} as a proposal distribution for Y (assuming X fixed).
  • Do a Metropolis-Hastings step for X, treating Y as fixed.
  • Do a Metropolis-Hastings step for Y, treating X as fixed.
  1. Start with some random variables \displaystyle X_n, Y_n, n = 0
  2. Draw Z~ \sim~ q(Z\mid{X_n})
  3. Set r = \min \{ 1, \frac{f(Z,Y_n)}{f(X_n,Y_n)}  \frac{q(X_n\mid{Z})}{q(Z\mid{X_n})} \}
  4. X_{n+1} = \begin{cases}
Z, & \text{with probability r}\\ 
X_n, & \text{with probability 1-r}\\
  5. Draw Z~ \sim~ \tilde{q}(Z\mid{Y_n})
  6. Set r = \min \{ 1, \frac{f(X_{n+1},Z)}{f(X_{n+1},Y_n)}  \frac{\tilde{q}(Y_n\mid{Z})}{\tilde{q}(Z\mid{Y_n})} \}
  7. Y_{n+1} = \begin{cases}
Z, & \text{with probability r} \\
Y_n, & \text{with probability 1-r}\\
  8. Set \displaystyle n = n + 1 , return to step 2 and repeat the same procedure

Page Ranking and Review of Linear Algebra - July 7

Page Ranking

Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the co-founders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.

In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page.

We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.

Factors relating to importance of links

1) Importance (rank) of linking web page (higher importance is better).

2) Number of outgoing links from linking web page (lower is better - since the importance of the original page itself may be diminished if it has a large number of outgoing links).


L_{i,j} = \begin{cases}
1 , & \text{if j links to i}\\
0 , & \text{else}\\ \end{cases}

c_{j}=\sum_{i=1}^N L_{i,j}\text{ = number of outgoing links from website j}

P_{i} = (1-d)\times 1+ (d) \times \sum_{j=1}^n \frac{L_{i,j} \times P_j}{c_j} \text{ = rank of i} 
\text{           where } 0 \leq d \leq 1

Under this formula, \displaystyle P_i is never zero. We weight the sum and the constant using \displaystyle d (which is just a coefficient between 0 and 1 used to balance the objective function).

In Matrix Form

\displaystyle P =  (1-d)\times e + d \times L \times D^{-1} \times P

where P=\left(\begin{matrix}P_{1}\\
P_{2}\\ \vdots \\ P_{N} \end{matrix}\right) e=\left(\begin{matrix} 1\\
1\\ \vdots \\1  \end{matrix}\right)

are both \displaystyle N X \displaystyle 1 matrices


D=\left(\begin{matrix}c_{1}& 0 &\dots& 0 \\
0 & c_{2}&\dots&0\\
0&0&\dots&c_{N} \end{matrix}\right)

D^{-1}=\left(\begin{matrix}1/c_{1}& 0 &\dots& 0 \\
0 & 1/c_{2}&\dots&0\\
0&0&\dots&1/c_{N} \end{matrix}\right)

Solving for P

Since rank is a relative term, if we make an assumption that

\sum_{i=1}^N P_i = 1

then we can solve for P (in matrix form this is \displaystyle e^T \times P = 1

\displaystyle P =  (1-d)\times e \times 1 + d \times L \times D^{-1} \times P

\displaystyle P =  (1-d)\times e \times e^T \times P + d \times L \times D^{-1} \times P \text{     by replacing 1 with } e^T \times P

\displaystyle P =  [(1-d) \times e \times e^T + d \times L \times D^{-1}] \times P  \text{       by factoring out the } P

\displaystyle P =  A \times P  \text{       by defining A (notice that everything in A is known )}

We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.

If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.

To solve for P, we start with a random guess \displaystyle P_0 and repeatedly apply

\displaystyle P_i <= A \times P_i-1

Since this is a stationary series, for large n \displaystyle P_n = P.

Exemple of utilisation of Page Ranking :

Suppose that the links between four webpages are:

1 → 2

↕ ↙

3 ← 4

According to the factors relating to importance of links, we can consider two possible rankings :

\displaystyle  3 > 2 > 1 > 4


\displaystyle 3>1>2>4 if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.

To rank our pages , we use the formula :

\displaystyle P =  [(1-d) \times\frac{ e \times e^T }{n}+ d \times L \times D^{-1}] \times P


L_{i,j} = \begin{cases}
1 , & \text{if j links to i}\\
0 , & \text{else}\\ \end{cases}

In this example,

1) L=\left(\begin{matrix}0&0&1&0\\

2) c_{j}=\sum_{i=1}^N L_{i,j} ,

So that here,

 c=\left[\begin{matrix} 2&1&1&1 \end{matrix}\right]

3) We have D=diag(C), so:


Algorithm on matlab:

First we define our matrix :

L=[0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0]

C=[2 1 1 1]




[vec val]=eig(A)

We are now going to choose a random vector p:




We now look at the probability of going on each page if we repeat many times a selection of a webpage





We finally get:

p =


So we conclude that   \displaystyle  3 > 1 > 2 > 4 as the probability of being on page three > Pr (page 1) > Pr (page 2) > Pr (page 4)

Linear Algebra Review

Inner Product[5]

Note that the inner product is also referred to as the dot product. If  \vec{u} = \left[\begin{matrix} u_1 \\ u_2 \\ \vdots \\ u_n \end{matrix}\right] \text{  and     } \vec{v} = \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] then the inner product is :

 \vec{u} \cdot \vec{v} = \vec{u}^T\vec{v} = \left[\begin{matrix} u_1 & u_2 & \dots & u_n \end{matrix}\right] \left[\begin{matrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{matrix}\right] = u_1v_1 + u_2v_2 + u_3v_3 + \dots + u_nv_n

The length (or norm) of \displaystyle \vec{v} is the non-negative scalar \displaystyle||\vec{v}|| defined by \displaystyle ||\vec{v}|| = \sqrt{\vec{v} \cdot \vec{v}} = \sqrt{v_1^2 + v_2^2 + \dots + v_n^2}

For \displaystyle \vec{u} and \displaystyle \vec{v} in \mathbf{R}^n , the distance between \displaystyle \vec{u} and \displaystyle \vec{v} written as  \displaystyle dist(\vec{u},\vec{v}) , is the length of the vector  \vec{u} - \vec{v}. That is,  \displaystyle dist(\vec{u},\vec{v}) = ||\vec{u} - \vec{v}||

If  \vec{u} and  \vec{v} are non-zero vectors in \mathbf{R}^2 or \mathbf{R}^3 , then the angle between  \vec{u} and  \vec{v} is given as \vec{u} \cdot \vec{v} = ||\vec{u}|| \ ||\vec{v}|| \ cos\theta

Orthogonal and Orthonormal[6]

Two vectors  \vec{u} and  \vec{v} in \mathbf{R}^n are orthogonal (to each other) if \vec{u} \cdot \vec{v} = 0

By the Pythagorean Theorem, it may also be said that two vectors  \vec{u} and  \vec{v} are orthogonal if and only if  ||\vec{u}+\vec{v}||^2 = ||\vec{u}||^2 + ||\vec{v}||^2

Two vectors  \vec{u} and  \vec{v} in \mathbf{R}^n are orthonormal if \vec{u} \cdot \vec{v} = 0 and ||\vec{u}||=||\vec{v}||=1

An orthonormal matrix \displaystyle U is a square invertible matrix, such that \displaystyle U^{-1} = U^T or alternatively \displaystyle U^T \ U = U \ U^T = I

Note that an orthogonal matrix is an orthonormal matrix.

Dependence and Independence[7]

The set of vectors  \{ \vec{v_1}, \dots , \vec{v_p} \} in \mathbf{R}^n is said to be linearly independent if the vector equation \displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0} has only the trivial solution (i.e. \displaystyle a_k = 0 \ \forall k ).

The set of vectors  \{ \vec{v_1}, \dots , \vec{v_p} \} in \mathbf{R}^n is said to be linearly dependent if there exists a set of coefficients  \{ a_1, \dots , a_p \} (not all zero), such that \displaystyle a_1\vec{v_1} + a_2\vec{v_2} + \dots + a_p\vec{v_p} = \vec{0}.

If a set contains more vectors than there are entries in each vector, then the set is linearly dependent.

That is, any vector set  \{ \vec{v_1}, \dots , \vec{v_p} \} in \mathbf{R}^n is linearly dependent if p > n.

If a vector set  \{ \vec{v_1}, \dots , \vec{v_p} \} in \mathbf{R}^n contains the zero vector, then the set is linearly dependent.

Trace and Rank[8]

The trace of a square matrix \displaystyle A_{nxn} , denoted by \displaystyle tr(A), is the sum of the diagonal entries in \displaystyle A . That is, \displaystyle tr(A) = \sum_{i = 1}^n a_{ii}

Note that an alternate definition for the trace is:

\displaystyle tr(A) = \sum_{i = 1}^n \lambda_{ii}

i.e. it is the sum of all the eigenvalues of the matrix

The rank of a matrix \displaystyle A , denoted by \displaystyle rank(A) , is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.

A square matrix is non-singular if and only if its rank equals the number of rows (or columns). Alternatively, a matrix is non-singular if it is invertible (i.e. its determinant is NOT zero). A matrix that is not invertible is sometimes called a singular matrix.

A matrix is said to be non-singular if and only if its rank equals the number of rows or columns. A non-singular matrix has a non-zero determinant.

A square matrix is said to be orthogonal if AAT = ATA = I.

For a square matrix A,

  • if  x^TAx > 0,\ \forall x \neq 0,then A is said to be positive-definite.
  • if  x^TAx \geq 0,\ \forall x \neq 0,then A is said to be positive-semidefinite.

The inverse of a square matrix A is denoted by A − 1 and is such that AA − 1 = A − 1A = I. The inverse of a matrix A exists if and only if A is non-singular.

The pseudo-inverse matrix A^{\dagger} is typically used whenever A − 1 does not exist because A is either not square or singular: A^{\dagger} = (A^TA)^{-1}A^T with A^{\dagger}A = I.

Vector Spaces

The n-dimensional space in which all the n-dimensional vectors reside is called a vector space.

A set of vectors {u1,u2,u3,...un} is said to form a basis for a vector space if any arbitrary vector x can be represented by a linear combination of the {ui}: x = a1u1 + a2u2 + ... + anun

  • The coefficients {a1,a2,} are called the components of vector x with the basis {ui}.
  • In order to form a basis, it is necessary and sufficient that the {ui} vectors be linearly independent.

A basis {ui} is said to be orthogonal if u^T_i u_j\begin{cases}
\neq 0, & \text{ if }i=j\\
= 0, & \text{ if } i\neq j\\

A basis {ui} is said to be orthonormal if u^T_i u_j = \begin{cases}
1, & \text{ if }i=j\\
0, & \text{ if } i\neq j\\

Eigenvectors and Eigenvalues

Given matrix ANxN, we say that v is an 'eigenvector if there exists a scalar λ (the eigenvalue) such that Av = λv where λ is the corresponding eigenvalue.

Computation of eigenvalues Av = \lambda v \Rightarrow Av - \lambda v = 0 \Rightarrow (A-\lambda I)v = 0 \Rightarrow \begin{cases}
v = 0, & \text{trivial solution}\\
(A-\lambda v) = 0, & \text{non-trivial solution}\\
\end{cases} (A-\lambda v) = 0 \Rightarrow |A-\lambda v| = 0 \Rightarrow \lambda^N + a_1\lambda^{N-1} + a_2\lambda^{N-2} + ... + a_{N-1}\lambda + a_0 = 0 \leftarrow Characteristic Equation


  • If A is non-singular all eigenvalues are non-zero.
  • If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.
  • If A is positive-definite all eigenvalues are positive

Linear Transformations

A linear transformation is a mapping from a vector space XN onto a vector space YM, and it is represented by a matrix

  • Given vector x \in X^N, the corresponding vector y on YM is computed as y = Ax.
  • The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).

A linear transformation represented by a square matrix A is said to be orthonormal when AAT = ATA = I

  • implies that AT = A − 1
  • An orthonormal transformation has the property of preserving the magnitude of the vectors:

|y| = \sqrt{y^Ty}  = \sqrt{(Ax)^T Ax} = \sqrt{x^Tx} = |x|

  • An orthonormal matrix can be thought of as a rotation of the reference frame
  • The row vectors of an orthonormal transformation form a set of orthonormal basis vectors.

Interpretation of Eigenvalues and Eigenvectors

If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.

  • When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.

Given the covariance matrix \sum of a Gaussian distribution

  • The eigenvectors of \sum are the principal directions of the distribution
  • The eigenvalues are the variances of the corresponding principal directions

The linear transformation defined by the eigenvectors of \sum leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).

  • If the distribution is Gaussian, then the transformed vectors will be statistically independent.

Principal Component Analysis - July 9

Principal Component Analysis (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.

Rough definition

Keepings two important aspects of data analysis in mind:

  • Reducing covariance in data
  • Preserving information stored in data(Variance is a source of information)

PCA takes a sample of d - dimensional vectors and produces an orthogonal(zero covariance) set of d 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.

Then we can preserve most of the variance in the sample in a lower dimension by choosing the first k Principle Components and approximating the data in k - dimensional space, which is easier to analyze and plot.

Principal Components of handwritten digits

Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset.

Threes dataset.png

We can represent each image as a vector of length 644 (644 = 23 \times 28) just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).

Matrix decomp PCA.png

Using PCA, we can approximate the data as the product of two smaller matrices, which I will call V \in M_{644,2} and W \in M_{2,103}. If we expand the matrix product then each image is approximated by a linear combination of the columns of V:  \hat{f}(\lambda) = \bar{x} + \lambda_1 v_1 + \lambda_2 v_2 , where λ = [λ12]T is a column of W.

Linear comb PCA.png

Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.

PCA plot.png

The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.

More examples

The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see Nonlinear dimensionality reduction).

  • Handwritten digits.
  • Recognition of hand orientation. (Isomap??)
  • Recognition of facial expressions. (LLE - Locally Linear Embedding?)
  • Arranging words based on semantic meaning.
  • Detecting beards and glasses on faces. (MDS - Multidimensional scaling?)

Derivation of the first Principle Component

We want to find the direction of maximum variation. Let \boldsymbol{w} be an arbitrary direction, \boldsymbol{x} a data point and \displaystyle u the length of the projection of \boldsymbol{x} in direction \boldsymbol{w}.

\textbf{w} &= [w_1, \ldots, w_D]^T \\
\textbf{x} &= [x_1, \ldots, x_D]^T \\
u &= \frac{\textbf{w}^T \textbf{x}}{\sqrt{\textbf{w}^T\textbf{w}}}

The direction \textbf{w} is the same as c\textbf{w} so without loss of generality,

|\textbf{w}| &= \sqrt{\textbf{w}^T\textbf{w}} = 1 \\
u &= \textbf{w}^T \textbf{x}

Let x_1, \ldots, x_D be random variables, then our goal is to maximize the variance of u,

\textrm{var}(u) = \textrm{var}(\textbf{w}^T \textbf{x}) = \textbf{w}^T \Sigma \textbf{w},

For a finite data set we replace the covariance matrix Σ by s, the sample covariance matrix,

\textrm{var}(u) = \textbf{w}^T s\textbf{w}

is the variance of any vector \displaystyle u , formed by the weight vector \displaystyle w . The first principal component is the vector that maximizes the variance,

\textrm{PC} = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \operatorname{var}(u) \right) = \underset{\textbf{w}}{\operatorname{arg\,max}} \, \left( \textbf{w}^T s \textbf{w} \right)

where arg max denotes the value of w that maximizes the function. Our goal is to find the weights \displaystyle w that maximize this variability, subject to a constraint.
The problem then becomes,

\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right)  
such that \textbf{w}^T \textbf{w} = 1


\textbf{w}^T s \textbf{w} \leq \| \textbf{w}^T s \textbf{w} \| \leq \| s \| \| \textbf{w} \| =  \| s \|

Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.

Principal Component Analysis Continued - July 14

From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: 
\underset{\textbf{w}}{\operatorname{max}} \, \left( \textbf{w}^T s \textbf{w} \right)  
with constraint \textbf{w}^T \textbf{w} = 1.

Before we can proceed, we must review Lagrange Multipliers.

Lagrange Multiplier

"The red line shows the constraint g(x,y) = c. The blue lines are contours of f(x,y). The point where the red line tangentially touches a blue contour is our solution." [Lagrange Multipliers, Wikipedia]

To find the maximum (or minimum) of a function \displaystyle f(x,y) subject to constraints \displaystyle g(x,y) = 0 , we define a new variable \displaystyle \lambda called a Lagrange Multiplier and we form the Lagrangian,

\displaystyle L(x,y,\lambda) = f(x,y) - \lambda g(x,y)

If \displaystyle (x^*,y^*) is the max of \displaystyle f(x,y), there exists \displaystyle \lambda^* such that \displaystyle (x^*,y^*,\lambda^*) is a stationary point of \displaystyle L (partial derivatives are 0).
In addition \displaystyle (x^*,y^*) is a point in which functions \displaystyle f and \displaystyle g touch but do not cross. At this point, the tangents of \displaystyle f and \displaystyle g are parallel or gradients of \displaystyle f and \displaystyle g are parallel, such that:

\displaystyle \nabla_{x,y } f = \lambda \nabla_{x,y } g

\displaystyle \nabla_{x,y} f = (\frac{\partial f}{\partial x},\frac{\partial f}{\partial{y}}) \leftarrow the gradient of \, f
\displaystyle \nabla_{x,y} g = (\frac{\partial g}{\partial{x}},\frac{\partial{g}}{\partial{y}}) \leftarrow the gradient of \, g


Suppose we wish to maximise the function \displaystyle f(x,y)=x-y subject to the constraint \displaystyle x^{2}+y^{2}=1. We can apply the Lagrange multiplier method on this example; the lagrangian is:

\displaystyle L(x,y,\lambda) = x-y - \lambda (x^{2}+y^{2}-1)

We want the partial derivatives equal to zero:

\displaystyle \frac{\partial L}{\partial x}=1+2 \lambda x=0

\displaystyle \frac{\partial L}{\partial y}=-1+2\lambda y=0

\displaystyle \frac{\partial L}{\partial \lambda}=x^2+y^2-1

Solving the system we obtain 2 stationary points: \displaystyle (\sqrt{2}/2,-\sqrt{2}/2) and \displaystyle (-\sqrt{2}/2,\sqrt{2}/2). In order to understand which one is the maximum, we just need to substitute it in \displaystyle f(x,y) and see which one as the biggest value. In this case the maximum is \displaystyle (\sqrt{2}/2,-\sqrt{2}/2).

Determining W

Back to the original problem, from the Lagrangian we obtain,

\displaystyle L(\textbf{w},\lambda) = \textbf{w}^T S \textbf{w} - \lambda (\textbf{w}^T \textbf{w} - 1)

If  \textbf{w}^T \textbf{w} is a unit vector then the second part of the equation is 0.

If  \textbf{w}^T  \textbf{w} is not a unit vector the the second part of the equation increases. Thus decreasing overall \displaystyle L(\textbf{w},\lambda). Maximization happens when  \textbf{w}^T \textbf{w} =1

(Note that to take the derivative with respect to w below,  \textbf{w}^T S \textbf{w} can be thought of as a quadratic function of w, hence the 2sw below. For more matrix derivatives, see section 2 of the Matrix Cookbook)

Taking the derivative with respect to w, we get:

\displaystyle \frac{\partial L}{\partial \textbf{w}} = 2S\textbf{w} - 2\lambda\textbf{w}

Set  \displaystyle \frac{\partial L}{\partial \textbf{w}} = 0 , we get

\displaystyle S\textbf{w}^* = \lambda^*\textbf{w}^*

From the eigenvalue equation \, \textbf{w}^* is an eigenvector of S and \, \lambda^* is the corresponding eigenvalue of S. If we substitute \displaystyle\textbf{w}^* in \displaystyle \textbf{w}^T S\textbf{w} we obtain,

\displaystyle\textbf{w}^{*T} S\textbf{w}^* = \textbf{w}^{*T} \lambda^* \textbf{w}^* = \lambda^* \textbf{w}^{*T} \textbf{w}^* = \lambda^*

In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, u1 to have the maximum variance
(i.e. capturing as much variability in in \displaystyle x_1, x_2,...,x_D as possible.) Subsequent principal components will take up successively smaller parts of the total variability.

D dimensional data will have D eigenvectors

\lambda_1 \geq \lambda_2 \geq ... \geq \lambda_D where each \, \lambda_i represents the amount of variation in direction \, i

so that

Var(u_1) \geq Var(u_2) \geq ... \geq Var(u_D)

Note that the Principal Components decompose the total variance in the data:

\displaystyle \sum_{i = 1}^D Var(u_i) = \sum_{i = 1}^D \lambda_i = Tr(S) = \sum_{i = 1}^D Var(x_i)

i.e. the sum of variations in all directions is the variation in the whole data

Example from class

We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images

The Matlab code is as follows:

 load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat')
 colormap gray
 [u s v] = svd(X);
 xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components
 imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc.
 colormap gray

Running the above code gives us 2 images - the first one represents the noisy data - we can barely make out the face

The second one is the denoised image

As you can clearly see, more features can be distinguished from the picture of the de-noised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.

Principle Component Analysis (continued) - July 16

Application of PCA - Feature Abstraction

One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).

Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.

Example: Comparing Images of 2s and 3s

To demonstrate this process, we can compare the images of 2s and 3s - from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.


The matlab code is as follows.

      load 2_3   %the size of this file is 64 X 400
      [coefs , scores ] = princomp (X')  
             % performs principal components analysis on the data matrix X
             % returns the principal component coefficients and scores
             % scores is the low dimensional representatioation of the data X
             % plots the first most variant dimension on the x-axis 
             % and the second highest on the y-axis  
             % same graph as above, only with the 2s (not 3s)
      hold on   % this command allows us to add to the current plot
      plot (scores(201:400,1),scores(201:400,2),'ro')
             % this addes the data for the 3s
             % the 'ro' command makes them red Os on the plot
      % If We classify based on the position in this plot (feature), 
      % its easier than looking at each of the 64 data pieces
      gname() % displays a figure window and 
              % waits for you to press a mouse button or a keyboard key

General PCA Algorithm

The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).


Other Notes:

  1. The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see detailsPCA in Wikipedia.)
  2. Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: X_{D\times n} \longrightarrow Y_{d\times n} using mapping \, U^T X_{d \times n} .
  3. When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing:  \hat{X}_{D\times n}\longleftarrow Y_{d \times n} using mapping \, UY_{D \times n} .
  4. We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.

Fisher's Linear Discriminant Analysis (FDA) - July 16(cont)

Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. no-glasses).

The number of dimensions that we want to reduce the data to, depends on the number of classes:
For a 2 class problem, we want to reduce the data to one dimension (a line), \displaystyle Z \in \mathbb{R}^{1}
Generally, for a k class problem, we want k-1 dimensions, \displaystyle Z \in \mathbb{R}^{k-1}

As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).

The following diagram summarizes this goal.


In fact, the two examples above may represent the same data projected on two different lines.


Goal: Maximum Separation

1. Minimize the within class variance

\displaystyle \min (w^T\sum_1w)

\displaystyle \min (w^T\sum_2w)

and this problem reduces to \displaystyle \min (w^T(\sum_1 + \sum_2)w)
(where \displaystyle \sum_1 and \sum_2 are the covariance matrices for the 1st and 2nd class of data respectively)

Let \displaystyle \ s_w=\sum_1 + \sum_2 be the within classes covariance. Then, this problem can be rewritten as \displaystyle \min (w^Ts_ww)

2. Maximize the distance between the means of the projected data

The optimization problem we want to solve is,

\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2,

\begin{align} (w^T \mu_1 - w^T \mu_2)^2 &= (w^T \mu_1 - w^T \mu_2)^T(w^T \mu_1 - w^T \mu_2)\\
 &= (\mu_1^Tw - \mu_2^Tw^T)(w^T \mu_1 - w^T \mu_2)\\
 &= (\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2) \end{align}

which is a scalar. Therefore,

\displaystyle = tr[(\mu_1^T - \mu_2^T)ww^T(\mu_1 - \mu_2)]

\displaystyle = tr[w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw]

(using the property of \displaystyle tr[ABC] = tr[CAB] = tr[BCA]

\displaystyle = w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw

Thus our original problem equivalent can be written as,

\displaystyle \max (w^T \mu_1 - w^T \mu_2)^2 = \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)

For a two class problem the between class variance is,

\displaystyle \ s_B=(\mu_1 - \mu_2)(\mu_1 - \mu_2)^T

Then this problem can be rewritten as,

\displaystyle \max (w^Ts_Bw)

Objective Function

We want an objective function which satisifies both of the goals outlined above (at the same time).

  1. \displaystyle \min (w^T(\sum_1 + \sum_2)w) or \displaystyle \min (w^Ts_ww)
  2. \displaystyle \max (w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw) or \displaystyle \max (w^Ts_Bw)

We take the ratio of the two -- we wish to maximize

\displaystyle \frac{(w^T(\mu_1 - \mu_2)(\mu_1 - \mu_2)^Tw)} {(w^T(\sum_1 + \sum_2)w)}

or equivalently,

\displaystyle \max \frac{(w^Ts_Bw)}{(w^Ts_ww)}

This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,

\displaystyle \max (w^Ts_Bw)
subject to \displaystyle (w^Ts_Ww=1)

We solve the following Lagrange Multiplier problem,

\displaystyle L(w,\lambda) = w^Ts_Bw - \lambda (w^Ts_Ww -1)

Continuation of Fisher's Linear Discriminant Analysis (FDA) - July 21

As discussed in the previous lecture, our Optimization problem for FDA is:

\max_{w} \frac{w^T s_B w}{w^T s_w w}

which we turned into

\displaystyle \max (w^Ts_Bw)
Subject to: \displaystyle (w^Ts_ww=1)

where sB is the covariance matrix between classes and sw is the covariance matrix within classes.

Using Lagrange multipliers, we have a Partial solution to: \displaystyle (w^Ts_Bw) - \lambda \cdot [(w^Ts_ww)-1]

- The optimal solution for w is the eigenvector of \displaystyle s_w^{-1}s_B corresponding to the largest eigenvalue;

- For a k class problem, we will take the eigenvectors corresponding to the (k-1) highest eigenvalues.

- In the case of two-class problem, the optimal solution for w can be simplfied, such that: \displaystyle w \propto s_w^{-1}(\mu_2 - \mu_1)


We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods.

      %First of all, we generate the two data set:
      X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) 
      %In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix.
      X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) 
      %Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3]
      %The plot of the two distributions is:


      %We compute the principal components:
      [coefs, scores]=princomp(X');
      coefs(:,1)    %first principal component
      ans =
      plot([0 coefs(1,1)], [0 coefs(2,1)],'b')
      plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r')
      sw=2*[1 1.5;1.5 3]   % sw=Sigma1+Sigma2=2*Sigma1
      w=sw\[4; 2]       % calculate s_w^{-1}(mu2 - mu1)
      plot ([0 w(1)], [0 w(2)],'g')

Pca full 1.jpg

      %We now make the projection:
      plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes"
      hold on

Fisher no overlap.jpg

      %We see that in the above picture that there is no overlapping
      hold on

Pca overlap.jpg

      %In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]

Classification - July 21 (cont)

The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, Y, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable X is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)

Consider independent and identically distributed data points  \displaystyle (x_1,y_1),...,(x_N,y_N) where  \displaystyle x_i \in X \subset \mathbb{R}^d and  y_i \in Y and Y is a finite set of discrete values. The set of data points  \displaystyle \{(x_1,y_1),...,(x_N,y_N)\} is called the training set.

Then \displaystyle h(x) is a classifier where, given a new data point  \displaystyle x,  \displaystyle h(x) predicts  \displaystyle y. The function  \displaystyle h is found using the training set. i.e. the set trains  \displaystyle h to map  \displaystyle X to  \displaystyle Y:

\, y = h(x)

\, h: X \to Y

To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.

Error Rate

\, \hat E(h) =Pr(h(x)\neq y)

Given test points, how can we find the emperical error rate?

We simply count the number of points that have been misclassified and divide by the total number of points.

\, \hat E(h) = \frac{1}{N} \sum_{i=1}^N I (Y_i \neq h(x_i))

Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.

Bayes Classification Rule

Considering the case of a two-class problem where  \mathcal{Y} = \{0,1\}

\, r(x)= Pr(Y=1 \mid X=x)= \frac {Pr(X=x \mid Y=1)P(Y=1)} {Pr(X=x)}

Where the denominator can be written as \, Pr(X=x) = Pr(X=x \mid Y=1)P(Y=1)+Pr(X=x \mid Y=0)P(Y=0)

So our classifier function h(x) = \begin{cases}
1 & r(x) \geq \frac{1}{2}  \\
0 & o/w\\

This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate \, r(x). A real-life interpretation of the marginal\, Pr(Y=1 \mid X=x) may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.

Set Differentiation.jpg

In the image, which set would it be more appropriate for the question mark to belong to?

This function is viewed as a theoretical bound - the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.

Decision Boundary

Decision boundary Joanna.jpg

The Decision boundary is given by:

\, Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)

Suggesting those points where the probabilities of being in both classes are identical. Thus,

\, D:  \{ x \mid Pr(Y=1 \mid X=x)= Pr(Y=0 \mid X=x)\}

Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.

Linear Discriminant Analysis(LDA) - July 23


"LDA decision boundary for 2 classes of multivariate normal data"

We would like to apply Bayes Classifiaction rule by approximating the class conditional density \, f_k(x) and the prior \, \pi_k P(Y=k|X=x) = \frac{f_k(x)\pi_k}{\sum_{\forall{k}}f_k(x_k)\pi_k}

By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,

  1. The class conditional density is multivariate gaussian
  2. The classes have a common covariance matrix


Note on Quadratic Form

\, (x + a)^TA(x+b) = x^TAx + a^TAb + x^TAb + a^TAx

By assumption (1),

f_k(x) = \frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}

where \, \Sigma_k is the class covariance matrix and \, \mu_k is the class mean. By definition of the decision boundary (decision boundary between class \ k and class \ l ),

\begin{align}&P(Y=k|X=x) = P(Y=l|X=x)\\ \\ &\frac{1}{(2\pi)^{d/2}|\Sigma_k|^{1/2}}e^{-\frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)}\pi_k = \frac{1}{(2\pi)^{d/2}|\Sigma_l|^{1/2}}e^{-\frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l)}\pi_l\end{align}

By assumption (2),

\begin{align}& \Sigma_k = \Sigma_l = \Sigma\\ \\ & e^{-\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k)}\pi_k = e^{-\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l)} \pi_l \\ & -\frac{1}{2}(x-\mu_k)^T\Sigma^{-1}(x-\mu_k) + \log(\pi_k) = -\frac{1}{2}(x-\mu_l)^T\Sigma^{-1}(x-\mu_l) + \log(\pi_l) \\ & -\frac{1}{2}(x^T\Sigma^{-1}x + \mu_k^T\Sigma^{-1}\mu_k - 2x^T\Sigma^{-1}\mu_k) + \frac{1}{2}(x^T\Sigma^{-1}x + \mu_l^T\Sigma^{-1}\mu_l - 2x^T\Sigma^{-1}\mu_l) + \log{\frac{\pi_k}{\pi_l}} = 0\ \\ \\ & x^T(\mu_k - \mu_l) + \frac{1}{2}(\mu_l - \mu_k)^T\Sigma^{-1}(\mu_l + \mu_k) + \log{\frac{\pi_k}{\pi_l}} = 0 \end{align}

The result is a linear function of \ x of the form \, x^Ta + b = 0 .

      %Load data set
      load 2_3;
      [coefs, scores] = princomp(X');
        % ans = 64 400          % 64 principal components
        % ans = 64 64 
        % ans = 400 64
      Y=scores(:, 1:2);
        % just use two of the 64 principal components
      plot(Y(1:200, 1),Y(1:200, 2), 'b.')
      hold on
      plot(Y(201:400, 1),Y(201:400, 2), 'r.')

Pca 2.jpg

      [C,err,P,logp] = classify(Y, Y, ll', 'linear');
      %The output of the function classify are:
      %  C, that is a vector whose values define the groups that are classified from the rows of the matrix
      %  err, that is an estimate of the misclassification error
      %  P, that gives the probabilities that the observation n belongs to the ith class 
      %  logp, that gives the logaritm of the probability density of the nth observation. Where the latter is the sum of 
      %        the conditional probability of the nth observation to belong to the class ith, times the probability of the 
      %        class ith, taken over all classes.

Computational Method

We can implement this computationally by the following:

Define two variables, \, \delta_k and \, \delta_l

 \,\delta_k  = log(f_k(x)\pi_k) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_k|)

 \,\delta_l  = log(f_l(x)\pi_l) = log (\pi_l) - \frac{1}{2}(x-\mu_l)^T\Sigma_l^{-1}(x-\mu_l) - \frac{d}{2}log(2\pi) - \frac{1}{2}log(|\Sigma_l|)

To classify a point, \ x , first compute \, \delta_k and \, \delta_l .

Classify it to class \ k if \, \delta_k > \delta_l and vise versa.

h(x) = \begin{cases}
k, & \text{if } \delta_k > \delta_l \\
l, &  otherwise\\

(note: since  - \frac{d}{2}log(2\pi) is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.)

Special Case: \, \Sigma_k = I , the identity matrix. Then,

 \,\delta_k  = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k) - \frac{1}{2}log(|I|) = log (\pi_k) - \frac{1}{2}(x-\mu_k)^T(x-\mu_k)

We see that in the case \, \Sigma = I , we can simply classify a point, \ x , to a class based on the distances between \ x and the mean of the different classes (adjusted with the log of the prior).

General Case: \, \Sigma_k \ne I

 \, \Sigma_k = USV^T = USU^T (since \ \Sigma is symmetric)

 \, \Sigma_k^{-1} = (USU^T)^{-1} = (U^T)^{-1}S^{-1}U^{-1} = US^{-1}U^T

So,  (x-\mu_k)^T\Sigma_k^{-1}(x-\mu_k)

 \, = (x-\mu_k)^T US^{-1}U^T(x-\mu_k)
 \, = (U^Tx-U^T\mu_k)^T S^{-1}(U^Tx-U^T\mu_k)
 \, = (U^Tx-U^T\mu_k)^T S^{-\frac{1}{2}}S^{-\frac{1}{2}}(U^Tx-U^T\mu_k)
 \, = (S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)^T I(S^{-\frac{1}{2}}U^Tx-S^{-\frac{1}{2}}U^T\mu_k)
 \, = (x^* - \mu_k^*)^T I(x^* - \mu_k^*)

where  \, x^* = S^{-\frac{1}{2}}U^Tx and  \mu_k^* = S^{-\frac{1}{2}}U^T\mu_k

Hence the approach taken should be to transpose point x from the beginning,

i.e. Let  \, x^* = S^{-\frac{1}{2}}U^Tx

Then compute  \, \delta_k and  \, \delta_l with x * , similar to the special case above.

If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.

In the δ function calculations above, πk = P(X = k) and πl = P(X = l), and can be approximated using the proportions of k and l elements in the training set.

More on Quadratic Discriminant Analysis - July 28

The curse of dimensionality is a problem in many fields, including computer science and statistics. Loosely stated, it says that many problems in computer science and statistics become impractical as the number of dimensions increases.

For example, the number of points needed to accurately estimate parameters increases with the number of parameters. For linear discriminant analysis we have to estimate \ d(d+1)/2 elements of the covariance matrix and \ 2d elements for the mean, for a total of \ (d^2 + 5d)/2. For quadratic discriminant analysis, we have to estimate a second covariance matrix, so we estimate \ d^2 + 3d parameters in total. Clearly we have to estimate more parameters for QDA than LDA, so we need more points to get an accurate estimate for the discriminant.


\ Y_{N\times 1} = X_{N\times d}^{-1}W_{d\times 1}

We can find very flexible boundaries with LDA by expanding the input domain. For example, suppose \ X has two features and can be represented as \ X=[x, y] then,

We can expand \ X to \ [x, y, x^2, y^2, xy] and get a quadratic boundary.
We can expand \ X to \ [x, y, xy, x^2, y^2, x^3, y^3, x^2y, xy^2] and get a cubic boundary.

In Matlab:

      %To double the input:
      load 2_3;
      [U, Y] = princomp(X');
      Y1=[Y Y.^2];
      %Now it classify using not only y but also y^2.
Personal tools