Stat341 / CM 361
From Wiki Course Notes
Computational Statistics and Data Analysis is a course offered at the University of Waterloo
Spring 2009
Instructor: Ali Ghodsi
Sampling (Generating random numbers)
Generating Random Numbers  May 12, 2009
Generating random numbers in a computational setting presents challenges. A good way to generate random numbers in computational statistics involves analyzing various distributions using computational methods. As a result, the probability distribution of each possible number appears to be uniform (pseudorandom). Outside a computational setting, presenting a uniform distribution is fairly easy (for example, rolling a fair die repetitively to produce a series of random numbers from 1 to 6).
We begin by considering the simplest case: the uniform distribution.
Multiplicative Congruential Method
One way to generate pseudo random numbers from the uniform distribution is using the Multiplicative Congruential Method. This involves three integer parameters a, b, and m, and a seed variable x_{0}. This method deterministically generates a sequence of numbers (based on the seed) with a seemingly random distribution (with some caveats). It proceeds as follows:
For example, with a = 13, b = 0, m = 31, x_{0} = 1, we have:
So,
etc.
The above generator of pseudorandom numbers is called a Mixed Congruential Generator or Linear Congruential Generator, as they involve both an additive and a muliplicative term. For correctly chosen values of a, b, and m, this method will generate a sequence of integers including all integers between 0 and m  1. Scaling the output by dividing the terms of the resulting sequence by m  1, we create a sequence of numbers between 0 and 1, which is similar to sampling from a uniform distribution.
Of course, not all values of a, b, and m will behave in this way, and will not be suitable for use in generating pseudo random numbers.
For example, with a = 3, b = 2, m = 4, x_{0} = 1, we have:
So,
etc.
For an ideal situation, we want m to be a very large prime number, for any n, and the period is equal to m1. In practice, it has been found by a paper published in 1988 by Park and Miller, that a = 7^{5}, b = 0, and m = 2^{31}  1 = 2147483647 (the maximum size of a signed integer in a 32bit system) are good values for the Multiplicative Congruential Method.
Java's Random class is based on a generator with a = 25214903917, b = 11, and m = 2^{48}^{[1]}. The class returns at most 32 leading bits from each x_{i}, so it is possible to get the same value twice in a row, (when x_{0} = 18698324575379, for instance) without repeating it forever.
General Methods
Since the Multiplicative Congruential Method can only be used for the uniform distribution, other methods must be developed in order to generate pseudo random numbers from other distributions.
Inverse Transform Method
This method uses the fact that when a random sample from the uniform distribution is applied to the inverse of a cumulative density function (cdf) of some distribution, the result is a random sample of that distribution. This is shown by this theorem:
Theorem:
If is a random variable and X = F^{ − 1}(U), where F is continuous, monotonic, and is the cumulative density function (cdf) for some distribution, then the distribution of the random variable X is given by F(X).
Proof:
Recall that, if f is the pdf corresponding to F where f is defined as 0 outside of its domain, then,
So F is monotonically increasing, since the probability that X is less than a greater number must be greater than the probability that X is less than a lesser number.
Note also that in the uniform distribution on [0, 1], we have for all a within [0, 1], .
So,
Completing the proof.
Procedure (Continuous Case)
This method then gives us the following procedure for finding pseudo random numbers from a continuous distribution:
 Step 1: Draw .
 Step 2: Compute X = F^{ − 1}(U).
Example:
Suppose we want to draw a sample from f(x) = λe^{ − λx} where x > 0 (the exponential distribution).
We need to first find F(x) and then its inverse, F^{ − 1}.
Now we can generate our random sample from F(x) by:
The x_{i} are now a random sample from f(x).
This example can be illustrated in Matlab using the code below. Generate u_{i}, calculate x_{i} using the above formula and letting θ = 1, plot the histogram of x_{i}'s for i = 1,...,100,000.
u=rand(1,100000); x=log(1u)/1; hist(x)
The major problem with this approach is that we have to find F^{ − 1} and for many distributions it is too difficult (or impossible) to find the inverse of F(x). Further, for some distributions it is not even possible to find F(x) (i.e. a closed form expression for the distribution function, or otherwise; even if the closed form expression exists, it's usually difficult to find F^{ − 1}).
Procedure (Discrete Case)
The above method can be easily adapted to work on discrete distributions as well.
In general in the discrete case, we have where:
Thus we can define the following method to find pseudo random numbers in the discrete case (note that the lessthan signs from class have been changed to lessthanorequalto signs by me, since otherwise the case of U = 1 is missed):
 Step 1: Draw .
 Step 2:
 If U < p_{0}, return X = x_{0}
 If , return X = x_{1}
 ...
 In general, if , return X = x_{k}
Example (from class):
Suppose we have the following discrete distribution:
The cumulative density function (cdf) for this distribution is then:
Then we can generate numbers from this distribution like this, given from :
This example can be illustrated in Matlab using the code below:
p=[0.3,0.2,0.5]; for i=1:1000; u=rand; if u <= p(1) x(i)=0; elseif u < sum(p(1,2)) x(i)=1; else x(i)=2; end end
AcceptanceRejection Sampling  May 14, 2009
Today, we continue the discussion on sampling (generating random numbers) from general distributions with the Acceptance/Rejection Method.
Acceptance/Rejection Method
Suppose we wish to sample from a target distribution f(x) that is difficult to sample from directly.
Let g(x) be a distribution that is easy to sample from and satisfies the condition:
, where
Since c*g(x) > f(x) for all x, it is possible to obtain samples that follows f(x) by rejecting a proportion of samples drawn from c*g(x).
This proportion depends on how different f(x) and g(x) are and may vary at different values of x.
That is, if , we will need to reject more samples drawn at than at .
Overall, it can shown that by accepting samples drawn from g(x) with probability , we can obtain samples that follows f(x)
Consider the example in the graph,
Sampling y = 7 from cg(x) will yield a sample that follows the target distribution f(x) and will y be accepted w/p 1.
Sampling y = 9 from cg(x) will yield a point that is distant from f(x) and will be accepted with a low probability.
Proof
Show that if points are sampled according to the Acceptance/Rejection method then they follow the target distribution.
by Bayes' theorem
by hypothesis.
Then,
Therefore,
as required.
Procedure (Continuous Case)
 Choose g(x) (a density function that is simple to sample from)
 Find a constant c such that :
 Let
 Let
 If then X=Y; else reject and go to step 1
Example:
Suppose we want to sample from Beta(2,1), for . Recall:
 Choose
 Find c: c = 2 (see notes below)
 Let
 Let
 If , then X=Y; else go to step 1
c was chosen to be 2 in this example since in this example is 2. This condition is important since it helps us in finding a suitable c to apply the Acceptance/Rejection Method.
In MATLAB, the code that demonstrates the result of this example is:
j = 1; while i < 1000 y = rand; u = rand; if u <= y x(j) = y; j = j + 1; i = i + 1; end end hist(x);
The histogram produced here should follow the target distribution, f(x) = 2x, for the interval .
The histogram shows the PDF of a Beta(2,1) distribution as expected.
The Discrete Case
The Acceptance/Rejection Method can be extended for discrete target distributions. The difference compared to the continuous case is that the proposal distribution g(x) must also be discrete distribution. For our purposes, we can consider g(x) to be a discrete uniform distribution on the set of values that X may take on in the target distribution.
Example
Suppose we want to sample from a distribution with the following probability mass function (pmf):
P(X=1) = 0.15 P(X=2) = 0.55 P(X=3) = 0.20 P(X=4) = 0.10
 Choose g(x) to be the discrete uniform distribution on the set {1,2,3,4}
 Find c:
 Generate
 Let
 If , then X=Y; else go to step 1
Limitations
If the proposed distribution is very different from the target distribution, we may have to reject a large number of points before a good sample size of the target distribution can be established. It may also be difficult to find such g(x) that satisfies all the conditions of the procedure.
We will now begin to discuss sampling from specific distributions.
Special Technique for sampling from Gamma Distribution
Recall that the cdf of the Gamma distribution is:
If we wish to sample from this distribution, it will be difficult for both the Inverse Method (difficulty in computing the inverse function) and the Acceptance/Rejection Method.
Additive Property of Gamma Distribution
Recall that if are independent exponential distributions with mean λ (in other words, ), then
It appears that if we want to sample from the Gamma distribution, we can consider sampling from t independent exponential distributions with mean λ (using the Inverse Method) and add them up. Details will be discussed in the next lecture.
Techniques for Normal and Gamma Sampling  May 19, 2009
We have examined two general techniques for sampling from distributions. However, for certain distributions more practical methods exist. We will now look at two cases,
Gamma distributions and Normal distributions, where such practical methods exist.
Gamma Distribution
Given the additive property of the gamma distribution,
If are independent random variables with then,
Using this property, we can use the Inverse Transform Method to generate samples from an exponential distribution with appropriate variables, and use these to generate a sample following a Gamma distribution.
 Procedure
 Sample independently from a uniform distribution t times, giving
 Sample independently from an exponential distribution t times, giving such that,
Using the Inverse Transform Method,
 Using the additive property,
This procedure can be illustrated in Matlab using the code below with t = 5,λ = 1 :
U = rand(10000,5); X = sum( log(U), 2); hist(X)
Normal Distribution
The cdf for the Standard Normal distribution is:
We can see that the normal distribution is difficult to sample from using the general methods seen so far, eg. the inverse is not easy to obtain from F(Z); we may be able to use the AcceptanceRejection method, but there are still better ways to sample from a Standard Normal Distribution.
BoxMuller Method
[Add a picture WikiSysop 19:25, 1 June 2009 (UTC)]
The BoxMuller or Polar method uses an approach where we have one space that is easy to sample in, and another with the desired distribution under a transformation. If we know such a transformation for the Standard Normal, then all we have to do is transform our easy sample and obtain a sample from the Standard Normal distribution.
 Properties of Polar and Cartesian Coordinates
 If x and y are points on the Cartesian plane, r is the length of the radius from a point in the polar plane to the pole, and θ is the angle formed with the polar axis then,
Let X and Y be independent random variables with a standard normal distribution,
also, let r and θ be the polar coordinates of (x,y). Then the joint distribution of independent x and y is given by,
It can also be shown that the joint distribution of r and θ is given by,
Note that consists of two density functions, Exponential and Uniform, so assuming that r and θ are independent
 Procedure for using BoxMuller Method
 Sample independently from a uniform distribution twice, giving
 Generate polar coordinates using the exponential distribution of d and uniform distribution of θ,
 Transform r and θ back to x and y,
Notice that the BoxMuller Method generates a pair of independent Standard Normal distributions, x and y.
This procedure can be illustrated in Matlab using the code below:
u1 = rand(5000,1); u2 = rand(5000,1); d = 2*log(u1); theta = 2*pi*u2; x = d.^(1/2).*cos(theta); y = d.^(1/2).*sin(theta); figure(1); subplot(2,1,1); hist(x); title('X'); subplot(2,1,2); hist(y); title('Y');
Also, we can confirm that d and theta are indeed exponential and uniform random variables, respectively, in Matlab by:
subplot(2,1,1); hist(d); title('d follows an exponential distribution'); subplot(2,1,2); hist(theta); title('theta follows a uniform distribution over [0, 2*pi]');
Useful Properties (Single and Multivariate Normal)
The BoxMuller method as described above samples only from the standard normal distribution. However, both singlevariate and multivariate normal distributions have properties that allow us to use samples generated by the BoxMuller method to sample any normal distribution in general.
 Properties of Normal distributions
These properties can be illustrated through the following example in Matlab using the code below:
Example: For a Multivariate Normal distribution and
u = [2; 3]; sigma = [ 1 1/2; 1/2 1]; r = randn(15000,2); ss = chol(sigma); X = ones(15000,1)*u' + r*ss; plot(X(:,1),X(:,2), '.');
Note: In the example above, we generated the square root of Σ using the Cholesky decomposition,
ss = chol(sigma);
which gives . Matlab also has the sqrtm function:
ss = sqrtm(sigma);
which gives a different matrix, , but the plot looks about the same (X has the same distribution).
Bayesian and Frequentist Schools of Thought  May 21, 2009
In this lecture we will continue to discuss sampling from specific distributions , introduce Monte Carlo Integration, and also talk about the differences between the Bayesian and Frequentist views on probability, along with references to Bayesian Inference.
Binomial Distribution
A Binomial distribution is the sum of n independent Bernoulli trials, each with probability of success p . For each trial we generate an independent uniform random variable: . Then X is the number of times that . In this case if n is large enough, by the central limit theorem, the Normal distribution can be used to approximate a Binomial distribution.
Sampling from Binomial distribution in Matlab is done using the following code:
n=3; p=0.5; trials=1000; X=sum((rand(trials,n))'<=p); hist(X)
Where the histogram is a Binomial distribution, and for higher n, it would resemble a Normal distribution.
Monte Carlo Integration
Monte Carlo Integration is a numerical method of approximating the evaluation of integrals by simulation. In this course we will mainly look at three methods for approximating integrals:
 Basic Monte Carlo Integration
 Importance Sampling
 Markov Chain Monte Carlo (MCMC)
Bayesian VS Frequentists
During the history of statistics, two major schools of thought emerged along the way and have been locked in an ongoing struggle in trying to determine which one has the correct view on probability. These two schools are known as the Bayesian and Frequentist schools of thought. Both the Bayesians and the Frequentists holds a different philosophical view on what defines probability. Below are some fundamental differences between the Bayesian and Frequentist schools of thought:
Frequentist
 Probability is objective and refers to the limit of an event's relative frequency in a large number of trials. For example, a coin with a 50% probability of heads will turn up heads 50% of the time.
 Parameters are all fixed and unknown constants.
 Any statistical process only has interpretations based on limited frequencies. For example, a 95% C.I. of a given parameter will contain the true value of the parameter 95% of the time.
Bayesian
 Probability is subjective and can be applied to single events based on degree of confidence or beliefs. For example, Bayesian can refer to tomorrow's weather as having 50% of rain, whereas this would not make sense to a Frequentist because tomorrow is just one unique event, and cannot be referred to as a relative frequency in a large number of trials.
 Parameters are random variables that has a given distribution, and other probability statements can be made about them.
 Probability has a distribution over the parameters, and point estimates are usually done by either taking the mode or the mean of the distribution.
Bayesian Inference
Example: If we have a screen that only displays single digits from 0 to 9, and this screen is split into a 4x5 matrix of pixels, then all together the 20 pixels that make up the screen can be referred to as , which is our data, and the parameter of the data for this case, which we will refer to as θ, would be a discrete random variable that can take on the values of 0 to 9. In this example, a Bayesian would be interested in finding , whereas a Frequentist would be more interested in finding
Bayes' Rule
Note: In this case f(θ  X) is referred to as posterior, f(X  θ) as likelihood, f(θ) as prior, and f(X) as the marginal, where θ is the parameter and X is the observed variable.
Procedure in Bayesian Inference
 First choose a probability distribution as the prior, which represents our beliefs about the parameters.
 Then choose a probability distribution for the likelihood, which represents our beliefs about the data.
 Lastly compute the posterior, which represents an update of our beliefs about the parameters after having observed the data.
As mentioned before, for a Bayesian, finding point estimates usually involves finding the mode or the mean of the parameter's distribution.
Methods
 Mode: value of θ that maximizes f(θ  X)
 Mean:
If it is the case that θ is highdimensional, and we are only interested in one of the components of θ, for example, we want θ_{1} from , then we would have to calculate the integral:
This sort of calculation is usually very difficult or not feasible to compute, and thus we would need to do it by simulation.
Note:
 is not a function of θ, and is called the Normalization Factor
 Therefore, since f(x) is like a constant, the posterior is proportional to the likelihood times the prior:
Monte Carlo Integration  May 26, 2009
Today's lecture completes the discussion on the Frequentists and Bayesian schools of thought and introduces Basic Monte Carlo Integration.
Frequentist vs Bayesian Example  Estimating Parameters
Estimating parameters of a univariate Gaussian:
Frequentist:
Bayesian:
Frequentist Approach
Let X^{N} denote (x_{1},x_{2},...,x_{n}). Using the Maximum Likelihood Estimation approach for estimating parameters we get:
Setting we get
Bayesian Approach
Assuming the prior is Gaussian:
By completing the square we conclude that the posterior is Gaussian as well:
Where
The expectation from the posterior is different from the MLE method. Note that . Also note that when N = 0 we get .
Basic Monte Carlo Integration
Although it is almost impossible to find a precise definition of "Monte Carlo Method", the method is widely used and has numerous descriptions in articles and monographs. As an interesting fact, the term Monte Carlo is claimed to have been first used by Ulam and von Neumann as a Los Alamos code word for the stochastic simulations they applied to building better atomic bombs. Stochastic simulation refers to a random process in which its future evolution is described by probability distributions (counterpart to a deterministic process), and these simulation methods are known as Monte Carlo methods. [Stochastic process, Wikipedia]. The following example (external link) illustrates a Monte Carlo Calculation of Pi: [1]
We start with a simple example:
where
 the p.d.f. is Unif(a,b)
If then by the Law of Large Numbers
Process
Given
From this we can compute other statistics, such as
 where with
 CI's can be estimated as
Example 1
Find for
 We need to draw from
This example can be illustrated in Matlab using the code below:
u=rand(100,1) x=log(u) h= x.^ .5 mean(h) %The value obtained using the Monte Carlo method F = @ (x) sqrt (x). * exp(x) quad(F,0,50) %The value of the real function using Matlab
Example 2 Find
This example can be illustrated in Matlab using the code below:
x = rand (1000) mean(x^3)
Example 3 To estimate an infinite integral such as
 Substitute in
Importance Sampling and Monte Carlo Simulation  May 28, 2009
During this lecture we covered two more examples of Monte Carlo simulation, finishing that topic, and begun talking about Importance Sampling.
Binomial Probability Monte Carlo Simulations
Example 1:
You are given two independent Binomial distributions with probabilities . Using a Monte Carlo simulation, approximate the value of , where .
 ; ;
So where is a flat distribution and the expected value of is as follows:
Since X, Y are independent, we can split the conditional probability distribution:
We need to find conditional distribution functions for to draw samples from. In order to get a distribution for the probability 'p' of a Binomial, we have to divide the Binomial distribution by n. This new distribution has the same shape as the original, but is scaled. A Beta distribution is a suitable approximation. Let
 and , where
Process:
 Draw samples for and : , , ..., ;
 Compute in order to get n values for ;
 .
Matlab Code:
 The Matlab code for recreating the above example is as follows:
n=100; %number of trials for X m=100; %number of trials for Y x=80; %number of successes for X trials y=60; %number of successes for y trials p1=betarnd(x+1, nx+1, 1, 1000); p2=betarnd(y+1, my+1, 1, 1000); delta=p1p2; mean(delta);
The mean in this example is given by 0.1938.
A 95% confidence interval for δ is represented by the interval between the 2.5% and 97.5% quantiles which covers 95% of the probability distribution. In Matlab, this can be calculated as follows:
q1=quantile(delta,0.025); q2=quantile(delta,0.975);
The interval is approximately
Note: In this case, we can also find E(δ) analytically since . Compare this with the maximum likelihood estimate for δ: .
Example 2:
Bayesian Inference for Dose Response
We conduct an experiment by giving rats one of ten possible doses of a drug, where each subsequent dose is more lethal than the previous one:
For each dose we test n rats and observe , the number of rats that die. Therefore,
.
We can assume that the probability of death grows with the concentration of drug given, i.e. . Estimate the dose at which the animals have at least 50% chance of dying.
 Let where
 We are interested in since any higher concentrations are known to have a higher death rate.
Solving this analytically is difficult:
 where g is an unknown function
 where
 where
Process: Monte Carlo
We assume that
 Draw
 Keep sample only if it satisfies , otherwise discard and try again.
 Compute by finding the first sample with over 50% deaths.
 Repeat process n times to get n estimates for .
 .
For instance, for each dose level X_{i}, for 1 < = i < = 10, 10 rats are used and it is observed that the numbers that are dying is Y_{i}, where Y_{1} = 4,Y_{2} = 3,etc.
Importance Sampling
In statistics, Importance Sampling helps estimating the properties of a particular distribution. As in the case with the Acceptance/Rejection method, we choose a good distribution from which to simulate the given random variables. The main difference in importance sampling however, is that certain values of the input random variables in a simulation have more impact on the parameter being estimated than others. As shown in the figure, the uniform distribution is a proposal distribution to sample from and g(x) is the target distribution.
Here we cast the integral , as the expectation of g(x) with respect to U such that,
.
Hence we can approximate this integral by,
.
[Source: Monte Carlo Methods and Importance Sampling, Eric C. Anderson (1999). Retrieved June 9th from URL: http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf]
In , Monte Carlo simulation can be used only if it easy to sample from f(x). Otherwise, another method must be applied. If sampling from f(x) is difficult but there exists a probability distribution function g(x) which is easy to sample from, then can be written as

the expectation of with respect to g(x) and therefore 

Process
 Choose such that it's easy to sample from and draw
 Compute

Note: By the law of large numbers, we can say that converges in probability to I.
"Weighted" average
 The term "importance sampling" is used to describe this method because a higher 'importance' or 'weighting' is given to the values sampled from that are closer to the original distribution , which we would ideally like to sample from (but cannot because it is too difficult).
 which is like saying that we are applying a regular Monte Carlo Simulation method to , and taking each result from this process and weighting the more accurate ones (i.e. the ones for which is high) higher.
One can view as a weight.
Then
i.e. we are computing a weighted sum of h(x_{i}) instead of a sum
A Deeper Look into Importance Sampling  June 2, 2009
From last class, we have determined that an integral can be written in the form We continue our discussion of Importance Sampling here.
Importance Sampling
We can see that the integral is just the expectation of h(x) with respect to g(x), where is a weight . In the case where , a greater weight for will be assigned. Thus, the points with more weight are deemed more important, hence "importance sampling". This can be seen as a variance reduction technique.
Problem
The method of Importance Sampling is simple but can lead to some problems. The estimated by Importance Sampling could have infinite standard error.
Given where .
Obtaining the second moment,
We can see that if , then . This occurs if has a thinner tail than then could be infinitely large. The general idea here is that should not be large.
Remark 1
It is evident that should be chosen such that it has a thicker tail than . If is large over set but is small, then would be large and it would result in a large variance.
Remark 2
It is useful if we can choose to be similar to in terms of shape. Ideally, the optimal should be similar to , and have a thicker tail. It's important to take the absolute value of , since a variance can't be negative. Analytically, we can show that the best is the one that would result in a variance that is minimized.
Remark 3
Choose such that it is similar to in terms of shape. That is, we want
Theorem (Minimum Variance Choice of )
The choice of that minimizes variance of is .
Proof:
We know that
The variance of is
As we can see, the second term does not depend on . Therefore to minimize we only need to minimize the first term. In doing so we will use Jensen's Inequality.
If is a convex function ( twice differentiable and ) then
Essentially the definition of convexity implies that the line segment between two points on a curve lies above the curve, which can then be generalized to higher dimensions:
 where
 =======================================================
 where
Proof (cont)
Using Jensen's Inequality,
 as
Therefore
and
since and are density functions, are non negative.
Thus, this is a lower bound on . If we replace into , we can see that the result is as we require. Details omitted.
However, this is mostly of theoritical interest. In practice, it is impossible or very difficult to compute .
Note: Jensen's inequality is actually unnecessary here. We just use it to get , which could be derived using variance properties: .
Importance Sampling and Markov Chain Monte Carlo (MCMC)  June 4, 2009
Remark 4:
where
Apply the idea of importance sampling to both the numerator and denominator:
 where
The above results in the following form of Importance Sampling:
 where
This is very important and useful especially when f is known only up to a proportionality constant. Often, this is the case in the Bayesian approach when f is a posterior density function.
Example of Importance Sampling
Estimate when
 Where
Approach I: Monte Carlo
 where
The idea here is to sample from normal distribution and to count number of observations that is greater than 3.
The variability will be high in this case if using Monte Carlo since this is considered a low probability event (a tail event), and different runs may give significantly different values. For example: the first run may give only 3 occurences (i.e if we generate 1000 samples, thus the probability will be .003), the second run may give 5 occurences (probability .005), etc.
This example can be illustrated in Matlab using the code below (we will be generating 100 samples in this case):
format long for i = 1:100 a(i) = sum(randn(100,1)>=3)/100; end meanMC = mean(a) varMC = var(a)
On running this, we get meanMC = 0.0005 and
Approach II: Importance Sampling
 where f(x) is standard normal and g(x) needs to be chosen wisely so that it is similar to the target distribution.
 Let
This example can be illustrated in Matlab using the code below:
for j = 1:100 N = 100; x = randn (N,1) + 4; for ii = 1:N h = x(ii)>=3; b = exp(84*x(ii)); w(ii) = h*b; end I(j) = sum(w)/N; end MEAN = mean(I) VAR = var(I)
Running the above code gave us and which is very close to 0, and is much less than the variability observed when using Monte Carlo
Markov Chain Monte Carlo (MCMC)
Before we tackle Markov chain Monte Carlo methods, which essentially are a 'class of algorithms for sampling from probability distributions based on constructing a Markov chain' [MCMC, Wikipedia], we will first give a formal definition of Markov Chain.
Consider the same integral:
Idea: If is a Markov Chain with stationary distribution f(x), then under some conditions
Stochastic Process:
A Stochastic Process is a collection of random variables
 State Space Set:is the set that random variables takes values from.
 Indexed Set:is the set that t takes values from, which could be discrete or continuous in general, but we are only interested in discrete case in this course.
Example 1
i.i.d random variables
 State Space
 Indexed Set
Example 2
 : price of a stock
 : opening date of the market
Basic Fact: In general, if we have random variables
 where
Markov Chain:
A Markov Chain is a special form of stochastic process in which depends only on .
For example,
Transition Probability:
The probability of going from one state to another state.
Transition Matrix:
For n states, transition matrix P is an matrix with entries as below:
Example:
A "Random Walk" is an example of a Markov Chain. Let's suppose that the direction of our next step is decided in a probabilistic way. The probability of moving to the right is . And the probability of moving to the left is . Once the first or the last state is reached, then we stop. The transition matrix that express this process is shown as below:
Markov Chain Definitions  June 9, 2009
Practical application for estimation:
The general concept for the application of this lies within having a set of generated x_{i} which approach a distribution f(x) so that a variation of importance estimation can be used to estimate an integral in the form
by
All that is required is a Markov chain which eventually converges to f(x).
In the previous example, the entries p_{ij} in the transition matrix P represent the probability of reaching state j from state i after one step. For this reason, the sum over all entries in a particular row sum to 1, as this itself must be a pmf if a transition from i is to lead to a state still within the state set for X_{t}.
Homogeneous Markov Chain
The probability matrix P is the same for all indicies .
If we denote the pmf of X_{n} by a probability vector
where i denotes an ordered index of all possible states of X.
Then we have a definition for the
marginal probabilty
where we simplify X_{n} to represent the ordered index of a state rather than the state itself.
From this definition it can be shown that,
Proof:
and since
Therefore,
With this, it is possible to define P(n) as an nstep transition matrix where
Theorem:
Proof: From the previous conclusion
And since this is a homogeneous Markov chain, P does not depend on i so
From this it becomes easy to define the nstep transition matrix as
Summary of definitions
 transition matrix is an NxN when N =  X  matrix with P_{ij} = Pr(X_{1} = j  X_{0} = i) where
 nstep transition matrix also NxN with P_{ij}(n) = Pr(X_{n} = j  X_{0} = i)
 marginal (probability of X)μ_{n}(i) = Pr(X_{n} = i)
 Theorem: P_{n} = P^{n}
 Theorem: μ_{n} = μ_{0}P^{n}

Definitions of different types of state sets
Define State Space
If P_{ij}(n) > 0 for some n , then we say i reaches j denoted by
This also mean j is accessible by i:
If and then we say i and j communicate, denoted by
Theorems
1)
2)
3) If
4) The set of states of X can be written as a unique disjoint union of subsets (equivalence classes) where two states i and j communicate IFF they belong to the same subset
More Definitions
A set is Irreducible if all states communicate with each other (has only one equivalence class).
A subset of states is Closed if once you enter it, you can never leave.
A subset of states is Open if once you leave it, you can never return.
An Absorbing Set is a closed set with only 1 element (i.e. consists of a single state).
Note
 We cannot have with i recurrent and j transient since .
 All states in an open class are transient.
 A Markov Chain with a finite number of states must have at least 1 recurrent state.
 A closed class with an infinite number of states has all transient or all recurrent states.
Again on Markov Chain  June 11, 2009
Decomposition of Markov chain
In the previous lecture it was shown that a Markov Chain can be written as the disjoint union of its classes. This decomposition is always possible and it is reduced to one class only in the case of an irreducible chain.
Example:
Let and the transition matrix be:
The decomposition in classes is:
 class 1: From the matrix we see that the states 1 and 2 have only and as nonzero transition probability
 class 2: The state 3 can go to every other state but none of the others can go to it
 class 3: This is an absorbing state since it is a close class and there is only one element
Recurrent and Transient states
A state i is called or if
 for some
That means that the probability to come back to the state i, starting from the state i, is 1.
If it is not the case (ie. probability less than 1), then state i is .
It is straight forward to prove that a finite irreducible chain is recurrent.
Theorem
Given a Markov chain,
A state is if and only if
A state is if and only if
Properties
 If is and then is
 If is and then is
 In an equivalence class, either all states are recurrent or all states are transient
 A finite Markov chain should have at least one recurrent state
 The states of a finite, irreducible Markov chain are all recurrent (proved using the previous preposition and the fact that there is only one class in this kind of chain)
In the example above, state one and two are a closed set, so they are both recurrent states. State four is an absorbing state, so it is also recurrent. State three is transient.
Example
Let and suppose that , and otherwise.
This is the Random Walk that we have already seen in a previous lecture, except it extends infinitely in both directions.
We now see other properties of this particular Markov chain:
 Since all states communicate if one of them is recurrent, then all states will be recurrent. On the other hand, if one of them is transient, then all the other will be transient too.
 Consider now the case in which we are in state . If we move of n steps to the right or to the left, the only way to go back to is to have n steps on the opposite direction.
We now want to know if this event is transient or recurrent or, equivalently, whether or not.
To proceed with the analysis, we use the :
The probability could therefore be approximated by:
And the formula becomes:
We can conclude that if then the state is transient, otherwise is recurrent.
An alternative to Stirling's approximation is to use the generalized binomial theorem to get the following formula:
Then substitute in x = pq.
So we reach the same conclusion: all states are recurrent iff .
Convergence of Markov chain
We define the as the minimum time to go from the state i to the state j. It is also possible that the state j is not reachable from the state i.
The mean of the recurrent time is defined as:
where
Using the objects we just introduced, we say that:
Lemma
In a finite state Markov Chain, all the recurrent state are positive
Periodic and aperiodic Markov chain
A Markov chain is called of period if, starting from a state, we will return to it every steps with probability we can only return to that state in a multiple of n steps.
Example
Considerate the threestate chain:
It's evident that, starting from the state 1, we will return to it on every 3^{rd} step and so it works for the other two states. The chain is therefore periodic with perdiod d = 3
An irreducible Markov chain is called if:
Another Example
Consider the chain
This chain is periodic by definition. You can only get back to state 1 after at least 2 steps period d = 2
Markov Chains and their Stationary Distributions  June 16, 2009
New Definition:Ergodic
A state is Ergodic if it is nonnull, recurrent, and aperiodic. A Markov Chain is ergodic if all its states are ergodic.
Define a vector π where and
∑  π_{i} = 1 
i 
(ie. π is a pmf)
π is a stationary distribution if π = πP where P is a transition matrix.
Limiting Distribution
If as
then π is the limiting distribution of the Markov Chain represented by P.
Theorem: An irreducible, ergodic Markov Chain has a unique stationary distribution π and there exists a limiting distribution which is also π.
Detailed Balance
The condition for detailed balanced is
Theorem
If π satisfies detailed balance then it is a stationary distribution.
Proof.
We need to show π = πP
as required
Warning! A chain that has a stationary distribution does not necessarily converge.
For example, has a stationary distribution but it will not converge.
Stationary Distribution
π is stationary (or invariant) distribution if π = π * p [0.5 0 0.5] Half of time their chain will spend half time in 1st state and half time in 3rd state.
Theorem
An irreducible ergodic Markov Chain has a unique stationary distribution π. The limiting distribution exists and is equal to π.
If g is any bounded function, then with probability 1:
Example
Find the limiting distribution of
Solve π = πP
Also
We can solve the above system of equations and obtain
Thus, and we get
Subbing back into the system of equations we obtain
and
Therefore the limiting distribution is
Monte Carlo using Markov Chain  June 18, 2009
Consider the problem of computing
Generate , ,... from a Markov Chain with stationary distribution
Metropolis Hastings Algorithm
The Metropolis Hastings Algorithm first originated in the physics community in 1953 and was adopted later on by statisticians. It was originally used for the computation of a Boltzmann distribution, which describes the distribution of energy for particles in a system. In 1970, Hastings extended the algorithm to the general procedure described below.
Suppose we wish to sample from f(x), the 'target' distribution. Choose q(y  x), the 'proposal' distribution that is easily sampled from.
1. Initialize , this is the starting point of the chain, choose it randomly and set index
2.
3. Compute , where
4.
5. If then
 else
6. Update index , and go to step 2
A couple of remarks about the algorithm
Remark 1: A good choice for is where is a constant. The starting point of the algorithm X_{0} = x, i.e. the proposal distibution is a normal centered at the current, randomly chosen, state.
Remark 2: If the proposal distribution is symmetric, , then . This is called the Metropolis Algorithm, which is a special case of the original algorithm of Metropolis (1953).
Remark 3: is symmetric. Probability of setting mean to x and sampling y is equal to the probability of setting mean to y and samplig x.
Example: The Cauchy distribution has density
Let the proposal distribution be
 since is symmetric
Now, having calculated , we complete the problem in Matlab using the following code:
b=2; % let b=2 for now, we will see what happens when b is smaller or larger X(1)=randn; for i=2:10000 Y=b*randn+X(i1); % we want to decide whether we accept this Y r=min( (1+X(i1)^2)/(1+Y^2),1); u=rand; if u<r X(i)=Y; % accept Y else X(i)=X(i1); % reject Y remaining in the current state end; end;
We need to be careful about choosing b!
 If b is too large
 Then the fraction would be very small is very small aswell.
 It is highly unlikely that , the probability of rejecting is high so the chain is likely to get stuck in the same state for a long time chain may not coverge to the right distribution.
 It is easy to observe by looking at the histogram of , the shape will not resemble the shape of the target
 Most likely we reject y and the chain will get stuck.
 If b is too small
 Then we are setting up our proposal distribution to be much narrower then than the target so the chain will not have a chance to explore the sample state space and visit majority of the states of the target .
 If b is just right
 Well chosen b will help avoid the issues mentioned above and we can say that the chain is "mixing well".
Mathematical explanation for why this algorithm works:
We talked about MC so far.
We have seen that:
 satisfies detailed balance if and
 if satisfies then it is a stationary distribution
In case we write the Detailed Balance as and say that
is if .
We want to show that if Detailed Balance holds (i.e. assume ) then is stationary distribution.
That is to show: is stationary distribution.

 and since
Now, we need to show that detailed balance holds in the MetropolisHastings...
Consider 2 points and :
 Either OR (ignoring that it might equal to 1)
Without loss of generality. suppose that the product is .
In this case and
 Some intuitive meanings before we continue:
 is jumping from to if proposal distribution generates and is accepted
 is jumping from to if proposal distribution generates and is accepted
 is the probability of generating
 is the probability of generating
 probability of accepting
 probability of accepting .
With that in mind we can show that as follows:
Cancelling out and bringing to the other side we get
equation
Multiplying both sides by we get
equation
Noticing that the right hand sides of the equation and equation are equal we conclude that:
as desired and thus showing that MetropolisHastings satisfies detailed balance.
Next lecture we will see that MetropolisHastings is also irreducible and ergodic thus showing that it converges.
Metropolis Hastings Algorithm Continued  June 25
Metropolis–Hastings algorithm is a Markov chain Monte Carlo method. It is used to help sample from probability distributions that are difficult to sample from. The algorithm was named after Nicholas Metropolis (19151999), also coauthor of the Simulated Anealing method (that is introduced in this lecture as well). The Gibbs sampling algorithm, that will be introduced next lecture, is a special case of the Metropolis–Hastings algorithm. This is a more efficient method, although less applicable at times.
In the last class, we showed that Metropolis Hastings satisfied the the detailbalance equations. i.e.
, which means is the stationary distribution of the chain.
But this is not enough, we want the chain to converge to the stationary distribution as well.
Thus, we also need it to be:
Irreducible: There is a positive probability to reach any nonempty set of states from any starting point. This is trivial for many choice of including the one that we used in the example in the previous lecture (which was normally distributed)
Aperiodic: The chain will not oscillate between different set of states. In the previous example, is , which will clearly not oscillate.
Next we discuss a couple of variations of Metropolis Hastings
Random Walk Metropolis Hastings
1. Draw , where has distribution ; ; is current state & is going to be close to
2. It means . (Note that is a function of distance between the current state and the state the chain is going to travel to, i.e. it's of the form . Hence we know in this version that is symmetric )
3.
Recall in our previous example we wanted to sample from the Cauchy distribution and our proposal distribution was
In matlab, we defined this as
(i.e
In this case, we need
The hard problem is to choose b so that the chain will mix well.
Rule of thumb: choose b such that the rejection probability is 0.5 (i.e. half the time accept, half the time reject)
Example
If we draw then , accept with probability 1
If we draw then , accept with probability
Hence, each point drawn from the proposal that belongs to a region with higher density will be accepted for sure (with probability 1), and if a point belongs to a region with less density, then the chance that it will be accepted will be less than 1.
Independence Metropolis Hastings
In this case, the proposal distribution is independent of the current state, i.e.
We draw from a fixed distribution
And define
And, this does not work unless is very similar to the target distribution (i.e. usually used when is known up to a proportionality constant  the form of the distibution is known, but the distribution is not exactly known)
Now, we pose the question: if does not depend on , does it mean that the sequence generated from this chain is really independent?
Answer: Even though does not depend on , but depends on . So it's not really an independent sequence!
Thus, the sequence is not really independent because acceptance probability depends on the state
Simulated Annealing
Simulated Annealing is a method of optimization and an application of the Metropolis Hastings Algorithm.
Consider the problem where we want to find x such that the objective function h(x) is at it's minimum,
Given a constant T and since the exponential function is monotone, this optimization problem is equivalent to,
We consider a distribution function such that , where T is called the temperature, and define the following procedure:
 Set T to be a large number
 Start with some random
 (note that is usually chosen to be symmetric)
 Define (when is symmetric)
IE. X_{i + 1} = Y if U < r
 Decrease T and go to Step 2
Now, we know that
Consider
Now, suppose T is large,
if and we therefore accept with probability 1
if and we therefore accept with probability
On the other hand, suppose T is arbitrarily small (),
if and we therefore accept with probability 1
if and we therefore reject
Example 1
Consider the problem of minimizing the function
We can plot this function and observe that it makes a local minimum near
We then plot the graphs of for and observe that the distribution expands for a large T, and contracts for T small  i.e. T plays the role of variance  making the distribution expand and contract accordingly.
In the end, T is small and the region we are trying to sample from becomes sharper. The points that we accept are increasingly close to the local max of the exponential function (which is the mode of the distribution), thereby corresponding to the local min of our original function (as can be seen above).
In practice the algorithm may get 'stuck' in another local minimum nearby for T too small and we don't get the convergence we looking for.
Example 2 (from June 30th lecture)
Suppose we want to minimize the function
Intuitively, we know that the answer is 2. To apply the Simulated Annealing procedure however, we require a proposal distribution. Suppose we use and we begin with
Then the problem may be solved in MATLAB using the following:
function v = obj(x) v = (x  2).^2;
T = 10; %this is the initial value of T, which we must gradually decrease b = 2; X(1) = 0; for i = 2:100 %as we change T, we will change i (e.g. i=101:200) Y = b*randn + X(i1); r = min(1 , exp(obj(Y)/T)/exp(obj(X(i1))/T) ); U = rand; if U < r X(i) = Y; %accept Y else X(i) = X(i1); %reject Y end; end;
The first run (with ) gives us
Next, if we let in the order displayed, then we get the following graph when we plot X:
i.e. it converges to the minimum of the function
Travelling Salesman Problem
This problem consists of finding the shortest path connecting a group of cities. The salesman must visit each city once and come back to the start in the shortest possible circuit. This problem is essentially one of optimization because the goal is to minimize the salesman's cost function (this function consists of the costs associated with travelling between two cities on a given path).
The travelling salesman problem is one of the most intensely investigated problems in computational mathematics and has been researched by many from diverse academic backgrounds including mathematics, CS, chemistry, physics, psychology, etc... Consequently, the travelling salesman problem now has applications in manufacturing, telecommunications, and neuroscience to name a few.^{[2]}
For a good introduction to the travelling salesman problem, along with a description of the theory involved in the problem and examples of its application, refer to a paper by Michael Hahsler and Kurt Hornik entitled Introduction to TSP  Infrastructure for the Travelling Salesman Problem. [2]
The examples are particularly useful because they are implemented using R (a statistical computing software environment).
Gibbs Sampling  June 30, 2009
This algorithm is a specific form of MetropolisHastings and is the most widely used version of the algorithm. It is used to generate a sequence of samples from the joint distribution of multiple random variables. It was first introduced by Geman and Geman (1984) and then further developed by Gelfand and Smith (1990).^{[3]} In order to use Gibbs Sampling, we must know how to sample from the conditional distributions. The point of Gibbs sampling is that given a multivariate distribution, it is simpler to sample from a conditional distribution than to integrate over a joint distribution. Gibbs Sampling also satisfies detailed balance equation, similar to MetropolisHastings
This implies that the chain is irreversible. The procedure of proving this balance equation is similar to what was done with MetropolisHasting proof.
Advantages
 The algorithm has an acceptance rate of 1. Thus it is efficient because we keep all the points we sample.
 It is useful for highdimensional distributions.
Disadvantages^{[4]}
 We rarely know how to sample from the conditional distributions.
 The algorithm can be extremely slow to converge.
 It is often difficult to know when convergence has occurred.
 The method is not practical when there are relatively small correlations between the random variables.
Example: Gibbs sampling is used if we want to sample from a multidimensional distribution  i.e.
We can use Gibbs sampling (assuming we know how to draw from the conditional distributions) by drawing
and the resulting set of points drawn follows the required multivariate distribution.
Suppose we want to sample from a bivariate distribution with initial point , i = 0
Furthermore, suppose that we know how to sample from the conditional distributions and
 (i.e. given the previous point, sample a new point)
 (note: it must be not Y_{i}, otherwise detailed balance may not hold)
 Repeat Steps 1 and 2
Note This method have usually a long time before convergence called "burning time". For this reason the distribution will be sampled better using only some of the last rather than all of them.
Example Suppose we want to generate samples from a bivariate normal distribution where and
Note that for a bivariate distribution it may be shown that the conditional distributions are normal. So, and
The problem (for a specified value ) may be solved in MATLAB using the following:
Y = [1 ; 2]; rho = 0.01; sigma = sqrt(1  rho^2); X(1,:) = [0 0];
for i = 2:5000 mu = Y(1) + rho*(X(i1,2)  Y(2)); X(i,1) = mu + sigma*randn; mu = Y(2) + rho*(X(i1,1)  Y(1)); X(i,2) = mu + sigma*randn; end; %plot(X(:,1),X(:,2),'.') plots all of the points %plot(X(1000:end,1),X(1000:end,2),'.') plots the last 4000 points > this demonstrates that convergence occurs after a while (this is called the burning time)
The output of plotting all points is:
MetropolisHastings within Gibbs Sampling  July 2
Thus far when discussing Gibbs Sampling, it has been assumed that we know how to sample from the conditional distributions. Even if this is not known, it is still possible to use Gibbs Sampling by utilizing the MetropolisHastings algorithm.
 Choose as a proposal distribution for X (assuming Y fixed).
 Choose as a proposal distribution for Y (assuming X fixed).
 Do a MetropolisHastings step for X, treating Y as fixed.
 Do a MetropolisHastings step for Y, treating X as fixed.
 Start with some random variables
 Draw
 Set
 Draw
 Set
 Set , return to step 2 and repeat the same procedure
Page Ranking and Review of Linear Algebra  July 7
Page Ranking
Page Rank is a form of link analysis algorithm, and it is named after Larry Page, who is a computer scientist and is one of the cofounders of Google. As an interesting note, the name "PageRank" is a trademark of Google, and the PageRank process has been patented. However the patent has been assigned to Stanford University instead of Google.
In the real world, the Page Ranking process is used by Internet search engines, namely Google. It assigns a numerical weighting to each web page within the World Wide Web which measures the relative importance of each page. To rank a web page in terms of importance, we look at the number of web pages that link to it. Additionally, we consider the relative importance of the linking web page.
We rank pages based on the weighted number of links to that particular page. A web page is important if so many pages point to it.
Factors relating to importance of links
1) Importance (rank) of linking web page (higher importance is better).
2) Number of outgoing links from linking web page (lower is better  since the importance of the original page itself may be diminished if it has a large number of outgoing links).
Definitions
Under this formula, is never zero. We weight the sum and the constant using (which is just a coefficient between 0 and 1 used to balance the objective function).
In Matrix Form
where
are both X matrices
Solving for P
Since rank is a relative term, if we make an assumption that
then we can solve for P (in matrix form this is
We can solve for P using two different methods. Firstly, we can recognize that P is an eigenvector corresponding to eigenvalue 1, for matrix A. Secondly, we can recognize that P is the stationary distribution for a transition matrix A.
If we look at this as a Markov Chain, this represents a random walk on the internet. There is a chance of jumping to an unlinked page (from the constant) and the probability of going to a page increases as the number of links to it increases.
To solve for P, we start with a random guess and repeatedly apply
Since this is a stationary series, for large n .
Exemple of utilisation of Page Ranking :
Suppose that the links between four webpages are:
1 → 2
↕ ↙
3 ← 4
According to the factors relating to importance of links, we can consider two possible rankings :
or
if we consider that the high importance of the link from 3 to 1 is more influent than the fact that there are two outgoing links from page 1 and only one from page 2.
To rank our pages , we use the formula :
Where
In this example,
1)
2) ,
So that here,
3) We have D=diag(C), so:
Algorithm on matlab:
L=[0 0 1 0; 1 0 0 0; 1 1 0 1; 0 0 0 0]
C=[2 1 1 1]
D=diag(C)
d=0.85
A=(1d)*ones(4)/4+d*L*inv(D)
[vec val]=eig(A)
We are now going to choose a random vector p:
p=rand(1,4)
p=p/sum(p)
p=p'
p=A*p
p=A*p
etc...
p=A*p
p =
0.3725 0.1958 0.3942 0.0375
So we conclude that as the probability of being on page three > Pr (page 1) > Pr (page 2) > Pr (page 4)
Linear Algebra Review
Inner Product^{[5]}
Note that the inner product is also referred to as the dot product. If then the inner product is :
The length (or norm) of is the nonnegative scalar defined by
For and in , the distance between and written as , is the length of the vector . That is,
If and are nonzero vectors in or , then the angle between and is given as
Orthogonal and Orthonormal^{[6]}
Two vectors and in are orthogonal (to each other) if
By the Pythagorean Theorem, it may also be said that two vectors and are orthogonal if and only if
Two vectors and in are orthonormal if and
An orthonormal matrix is a square invertible matrix, such that or alternatively
Note that an orthogonal matrix is an orthonormal matrix.
Dependence and Independence^{[7]}
The set of vectors in is said to be linearly independent if the vector equation has only the trivial solution (i.e. ).
The set of vectors in is said to be linearly dependent if there exists a set of coefficients (not all zero), such that .
If a set contains more vectors than there are entries in each vector, then the set is linearly dependent.
That is, any vector set in is linearly dependent if p > n.
If a vector set in contains the zero vector, then the set is linearly dependent.
Trace and Rank^{[8]}
The trace of a square matrix , denoted by , is the sum of the diagonal entries in . That is,
Note that an alternate definition for the trace is:
i.e. it is the sum of all the eigenvalues of the matrix
The rank of a matrix , denoted by , is the dimension of the column space of A. That is, the rank of a matrix is number of linearly independent rows (or columns) of A.
A square matrix is nonsingular if and only if its rank equals the number of rows (or columns). Alternatively, a matrix is nonsingular if it is invertible (i.e. its determinant is NOT zero).
A matrix that is not invertible is sometimes called a singular matrix.
A matrix is said to be nonsingular if and only if its rank equals the number of rows or columns. A nonsingular matrix has a nonzero determinant.
A square matrix is said to be orthogonal if AA^{T} = A^{T}A = I.
For a square matrix A,
 if ,then A is said to be positivedefinite.
 if ,then A is said to be positivesemidefinite.
The inverse of a square matrix A is denoted by A^{ − 1} and is such that AA^{ − 1} = A^{ − 1}A = I. The inverse of a matrix A exists if and only if A is nonsingular.
The pseudoinverse matrix is typically used whenever A^{ − 1} does not exist because A is either not square or singular: with .
Vector Spaces
The ndimensional space in which all the ndimensional vectors reside is called a vector space.
A set of vectors {u_{1},u_{2},u_{3},...u_{n}} is said to form a basis for a vector space if any arbitrary vector x can be represented by a linear combination of the {u_{i}}: x = a_{1}u_{1} + a_{2}u_{2} + ... + a_{n}u_{n}
 The coefficients {a_{1},a_{2},...a_{n}} are called the components of vector x with the basis {u_{i}}.
 In order to form a basis, it is necessary and sufficient that the {u_{i}} vectors be linearly independent.
A basis {u_{i}} is said to be orthogonal if
A basis {u_{i}} is said to be orthonormal if
Eigenvectors and Eigenvalues
Given matrix A_{NxN}, we say that v is an 'eigenvector if there exists a scalar λ (the eigenvalue) such that Av = λv where λ is the corresponding eigenvalue.
Computation of eigenvalues Characteristic Equation
Properties
 If A is nonsingular all eigenvalues are nonzero.
 If A is real and symmetric, all eigenvalues are real and the associated eigenvectors are orthogonal.
 If A is positivedefinite all eigenvalues are positive
Linear Transformations
A linear transformation is a mapping from a vector space X^{N} onto a vector space Y^{M}, and it is represented by a matrix
 Given vector , the corresponding vector y on Y^{M} is computed as y = Ax.
 The dimensionality of the two spaces does not have to be the same (M and N do not have to be equal).
A linear transformation represented by a square matrix A is said to be orthonormal when AA^{T} = A^{T}A = I
 implies that A^{T} = A^{ − 1}
 An orthonormal transformation has the property of preserving the magnitude of the vectors:
 An orthonormal matrix can be thought of as a rotation of the reference frame
 The row vectors of an orthonormal transformation form a set of orthonormal basis vectors.
Interpretation of Eigenvalues and Eigenvectors
If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space.
 When transformed by A, any point lying on the direction defined by v will remain on that direction and its magnitude will be multiplied by the corresponding eigenvalue.
Given the covariance matrix of a Gaussian distribution
 The eigenvectors of are the principal directions of the distribution
 The eigenvalues are the variances of the corresponding principal directions
The linear transformation defined by the eigenvectors of leads to vectors that are uncorrelated regardless of the form of the distribution (This is used in Principal Component Analysis).
 If the distribution is Gaussian, then the transformed vectors will be statistically independent.
Principal Component Analysis  July 9
Principal Component Analysis (PCA) is a powerful technique for reducing the dimensionality of a data set. It has applications in data visualization, data mining, classification, etc. It is mostly used for data analysis and for making predictive models.
Rough definition
Keepings two important aspects of data analysis in mind:
 Reducing covariance in data
 Preserving information stored in data(Variance is a source of information)
PCA takes a sample of d  dimensional vectors and produces an orthogonal(zero covariance) set of d 'Principal Components'. The first Principal Component is the direction of greatest variance in the sample. The second principal component is the direction of second greatest variance (orthogonal to the first component), etc.
Then we can preserve most of the variance in the sample in a lower dimension by choosing the first k Principle Components and approximating the data in k  dimensional space, which is easier to analyze and plot.
Principal Components of handwritten digits
Suppose that we have a set of 103 images (28 by 23 pixels) of handwritten threes, similar to the assignment dataset.
We can represent each image as a vector of length 644 () just like in assignment 5. Then we can represent the entire data set as a 644 by 103 matrix, shown below. Each column represents one image (644 rows = 644 pixels).
Using PCA, we can approximate the data as the product of two smaller matrices, which I will call and . If we expand the matrix product then each image is approximated by a linear combination of the columns of V: , where λ = [λ_{1},λ_{2}]^{T} is a column of W.
Don't worry about the constant term for now. The point is that we can represent an image using just 2 coefficients instead of 644. Also notice that the coefficients correspond to features of the handwritten digits. The picture below shows the first two principal components for the set of handwritten threes.
The first coefficient represents the width of the entire digit, and the second coefficient represents the thickness of the stroke.
More examples
The slides cover several examples. Some of them use PCA, others use similar, more sophisticated techniques outside the scope of this course (see Nonlinear dimensionality reduction).
 Handwritten digits.
 Recognition of hand orientation. (Isomap??)
 Recognition of facial expressions. (LLE  Locally Linear Embedding?)
 Arranging words based on semantic meaning.
 Detecting beards and glasses on faces. (MDS  Multidimensional scaling?)
Derivation of the first Principle Component
We want to find the direction of maximum variation. Let be an arbitrary direction, a data point and the length of the projection of in direction .
The direction is the same as so without loss of generality,
Let be random variables, then our goal is to maximize the variance of u,
For a finite data set we replace the covariance matrix Σ by s, the sample covariance matrix,
is the variance of any vector , formed by the weight vector . The first principal component is the vector that maximizes the variance,
where arg max denotes the value of w that maximizes the function. Our goal is to find the weights that maximize this variability, subject to a constraint.
The problem then becomes,
such that
Notice,
Therefore the variance is bounded, so the maximum exists. We find the this maximum using the method of Lagrange multipliers.
Principal Component Analysis Continued  July 14
From the previous lecture, we have seen that to take the direction of maximum variance, the problem becomes: with constraint .
Before we can proceed, we must review Lagrange Multipliers.
Lagrange Multiplier
To find the maximum (or minimum) of a function subject to constraints , we define a new variable called a Lagrange Multiplier and we form the Lagrangian,
If is the max of , there exists such that is a stationary point of (partial derivatives are 0).
In addition is a point in which functions and touch but do not cross. At this point, the tangents of and are parallel or gradients of and are parallel, such that:
where,
the gradient of
the gradient of
Example
Suppose we wish to maximise the function subject to the constraint . We can apply the Lagrange multiplier method on this example; the lagrangian is:
We want the partial derivatives equal to zero:
Solving the system we obtain 2 stationary points: and . In order to understand which one is the maximum, we just need to substitute it in and see which one as the biggest value. In this case the maximum is .
Determining W
Back to the original problem, from the Lagrangian we obtain,
If is a unit vector then the second part of the equation is 0.
If is not a unit vector the the second part of the equation increases. Thus decreasing overall . Maximization happens when
(Note that to take the derivative with respect to w below, can be thought of as a quadratic function of w, hence the 2sw below. For more matrix derivatives, see section 2 of the Matrix Cookbook)
Taking the derivative with respect to w, we get:
Set , we get
From the eigenvalue equation is an eigenvector of S and is the corresponding eigenvalue of S. If we substitute in we obtain,
In order to maximize the objective function we choose the eigenvector corresponding to the largest eigenvalue. We choose the first PC, u_{1} to have the maximum variance
(i.e. capturing as much variability in in as possible.) Subsequent principal components will take up successively smaller parts of the total variability.
D dimensional data will have D eigenvectors
where each represents the amount of variation in direction
so that
Note that the Principal Components decompose the total variance in the data:
i.e. the sum of variations in all directions is the variation in the whole data
Example from class
We apply PCA to the noise data, making the assumption that the intrinsic dimensionality of the data is 10. We now try to compute the reconstructed images using the top 10 eigenvectors and plot the original and reconstructed images
The Matlab code is as follows:
load('C:\Documents and Settings\r2malik\Desktop\STAT 341\noisy.mat') who size(X) imagesc(reshape(X(:,1),20,28)') colormap gray imagesc(reshape(X(:,1),20,28)') [u s v] = svd(X); xHat = u(:,1:10)*s(1:10,1:10)*v(:,1:10)'; % use ten principal components figure imagesc(reshape(xHat(:,1000),20,28)') % here '1000' can be changed to different values, e.g. 105, 500, etc. colormap gray
Running the above code gives us 2 images  the first one represents the noisy data  we can barely make out the face
The second one is the denoised image
As you can clearly see, more features can be distinguished from the picture of the denoised face compared to the picture of the noisy face. If we took more principal components, at first the image would improve since the intrinsic dimensionality is probably more than 10. But if you include all the components you get the noisy image, so not all of the principal components improve the image. In general, it is difficult to choose the optimal number of components.
Principle Component Analysis (continued)  July 16
Application of PCA  Feature Abstraction
One of the applications of PCA is to group similar data (e.g. images). There are generally two methods to do this. We can classify the data (i.e. give each data a label and compare different types of data) or cluster (i.e. do not label the data and compare output for classes).
Generally speaking, we can do this with the entire data set (if we have an 8X8 picture, we can use all 64 pixels). However, this is hard, and it is easier to use the reduced data and features of the data.
Example: Comparing Images of 2s and 3s
To demonstrate this process, we can compare the images of 2s and 3s  from the same data set we have been using throughout the course. We will apply PCA to the data, and compare the images of the labeled data. This is an example in classifying.
The matlab code is as follows.
load 2_3 %the size of this file is 64 X 400 [coefs , scores ] = princomp (X') % performs principal components analysis on the data matrix X % returns the principal component coefficients and scores % scores is the low dimensional representatioation of the data X plot(scores(:,1),scores(:,2)) % plots the first most variant dimension on the xaxis % and the second highest on the yaxis plot(scores(1:200,1),scores(1:200,2)) % same graph as above, only with the 2s (not 3s) hold on % this command allows us to add to the current plot plot (scores(201:400,1),scores(201:400,2),'ro') % this addes the data for the 3s % the 'ro' command makes them red Os on the plot % If We classify based on the position in this plot (feature), % its easier than looking at each of the 64 data pieces gname() % displays a figure window and % waits for you to press a mouse button or a keyboard key figure subplot(1,2,1) imagesc(reshape(X(:,45),8,8)') subplot(1,2,2) imagesc(reshape(X(:,93),8,8)')
General PCA Algorithm
The PCA Algorithm is summarized in the following slide (taken from the Lecture Slides).
Other Notes:
 The mean of the data(X) must be 0. This means we may have to preprocess the data by subtracting off the mean(see detailsPCA in Wikipedia.)
 Encoding the data means that we are projecting the data onto a lower dimensional subspace by taking the inner product. Encoding: using mapping .
 When we reconstruct the training set, we are only using the top d dimensions.This will eliminate the dimensions that have lower variance (e.g. noise). Reconstructing: using mapping .
 We can compare the reconstructed test sample to the reconstructed training sample to classify the new data.
Fisher's Linear Discriminant Analysis (FDA)  July 16(cont)
Similar to PCA, the goal of FDA is to project the data in a lower dimension. The difference is that we we are not interested in maximizing variances. Rather our goal is to find a direction that is useful for classifying the data (i.e. in this case, we are looking for direction representative of a particular characteristic e.g. glasses vs. noglasses).
The number of dimensions that we want to reduce the data to, depends on the number of classes:
For a 2 class problem, we want to reduce the data to one dimension (a line),
Generally, for a k class problem, we want k1 dimensions,
As we will see from our objective function, we want to maximize the separation of the classes and to minimize the within variance of each class. That is, our ideal situation is that the individual classes are as far away from each other as possible, but the data within each class is close together (i.e. collapse to a single point).
The following diagram summarizes this goal.
In fact, the two examples above may represent the same data projected on two different lines.
Goal: Maximum Separation
1. Minimize the within class variance
and this problem reduces to
(where are the covariance matrices for the 1st and 2nd class of data respectively)
Let be the within classes covariance. Then, this problem can be rewritten as
2. Maximize the distance between the means of the projected data
The optimization problem we want to solve is,
which is a scalar. Therefore,
(using the property of
Thus our original problem equivalent can be written as,
For a two class problem the between class variance is,
Then this problem can be rewritten as,
Objective Function
We want an objective function which satisifies both of the goals outlined above (at the same time).
 or
 or
We take the ratio of the two  we wish to maximize
or equivalently,
This is a very famous problem which is called "the generalized eigenvector problem". We can solve this using Lagrange Multipliers. Since W is a directional vector, we do not care about the size of W. Therefore we solve a problem similar to that in PCA,
subject to
We solve the following Lagrange Multiplier problem,
Continuation of Fisher's Linear Discriminant Analysis (FDA)  July 21
As discussed in the previous lecture, our Optimization problem for FDA is:
which we turned into
Subject to:
where s_{B} is the covariance matrix between classes and s_{w} is the covariance matrix within classes.
Using Lagrange multipliers, we have a Partial solution to:
 The optimal solution for w is the eigenvector of corresponding to the largest eigenvalue;
 For a k class problem, we will take the eigenvectors corresponding to the (k1) highest eigenvalues.
 In the case of twoclass problem, the optimal solution for w can be simplfied, such that:
Example:
We see now an application of the theory that we just introduced. Using Matlab, we find the principal components and the projection by Fisher Discriminant Analysis of two Bivariate normal distributions to show the difference between the two methods.
%First of all, we generate the two data set: X1 = mvnrnd([1,1],[1 1.5; 1.5 3], 300) %In this case: mu_1=[1;1]; Sigma_1=[1 1.5; 1.5 3], where mu and sigma are the mean and covariance matrix. X2 = mvnrnd([5,3],[1 1.5; 1.5 3], 300) %Here mu_2=[5;3] and Sigma_2=[1 1.5; 1.5 3] %The plot of the two distributions is:
%We compute the principal components: X=[X1,X2] X=X' [coefs, scores]=princomp(X'); coefs(:,1) %first principal component coefs(:,1) ans = 0.76355476446932 0.64574307712603 plot([0 coefs(1,1)], [0 coefs(2,1)],'b') plot([0 coefs(1,1)]*10, [0 coefs(2,1)]*10,'r') sw=2*[1 1.5;1.5 3] % sw=Sigma1+Sigma2=2*Sigma1 w=sw\[4; 2] % calculate s_w^{1}(mu2  mu1) plot ([0 w(1)], [0 w(2)],'g')
%We now make the projection: Xf=w'*X figure plot(Xf(1:300),1,'ob') %In this case, since it's a one dimension data, the plot is "Data Vs Indexes" hold on plot(Xf(301:600),1,'or')
%We see that in the above picture that there is no overlapping Xp=coefs(:,1)'*X figure plot(Xp(1:300),1,'b') hold on plot(Xp(301:600),2,'or')
%In this case there is an overlapping since we project the first principal component on [Xp=coefs(:,1)'*X]
Classification  July 21 (cont)
The process of classification involves predicting a discrete random variable from another (not necessarily discrete) random variable. For instance we could be wishing to classify an image as a chair, a desk, or a person. The discrete random variable, Y, is drawn from the set 'chair,' 'desk,' and 'person,' while the random variable X is the image. (In the case in which Y is a continuous variable, classification is an application of regression.)
Consider independent and identically distributed data points where and and Y is a finite set of discrete values. The set of data points is called the training set.
Then is a classifier where, given a new data point , predicts . The function is found using the training set. i.e. the set trains to map to :
To continue with the example before, given a training set of images displaying desks, chairs, and people, the function should be able to read a new image never before seen and predict what the image displays out of the three above options, within a margin of error.
Error Rate
Given test points, how can we find the emperical error rate?
We simply count the number of points that have been misclassified and divide by the total number of points.
Error rate is also called misclassification rate, and 1 minus the error rate is sometimes called the classification rate.
Bayes Classification Rule
Considering the case of a twoclass problem where
Where the denominator can be written as
So our classifier function
This function is considered to be the best classifier in terms of error rate. A problem is that we do not know the joint and marginal probability distributions to calculate . A reallife interpretation of the marginal may even deal with patterns and meaning, which provides an extra challenge in finding a mathematical interpretation.
In the image, which set would it be more appropriate for the question mark to belong to?
This function is viewed as a theoretical bound  the best that can be achieved by various classification techniques. One such technique is finding the Decision Boundary.
Decision Boundary
The Decision boundary is given by:
Suggesting those points where the probabilities of being in both classes are identical. Thus,
Linear discriminant analysis has a decision boundary represented by a linear function, while quadratic discriminant analysis has a decision boundary represented by a quadratic function.
Linear Discriminant Analysis(LDA)  July 23
Motivation
We would like to apply Bayes Classifiaction rule by approximating the class conditional density
and the prior
By making the following assumptions we can find a linear approximation to the boundary given by Bayes rule,
 The class conditional density is multivariate gaussian
 The classes have a common covariance matrix
Derivation
 Note on Quadratic Form
By assumption (1),
where is the class covariance matrix and is the class mean. By definition of the decision boundary (decision boundary between class and class ),
By assumption (2),
The result is a linear function of of the form .
Example:
%Load data set load 2_3; [coefs, scores] = princomp(X'); size(X) % ans = 64 400 % 64 principal components size(coefs) % ans = 64 64 size(scores) % ans = 400 64 Y=scores(:, 1:2); % just use two of the 64 principal components plot(Y(1:200, 1),Y(1:200, 2), 'b.') hold on plot(Y(201:400, 1),Y(201:400, 2), 'r.')
ll=[zeros(200,1),ones(200,1)]; [C,err,P,logp] = classify(Y, Y, ll', 'linear'); %The output of the function classify are: % C, that is a vector whose values define the groups that are classified from the rows of the matrix % err, that is an estimate of the misclassification error % P, that gives the probabilities that the observation n belongs to the ith class % logp, that gives the logaritm of the probability density of the nth observation. Where the latter is the sum of % the conditional probability of the nth observation to belong to the class ith, times the probability of the % class ith, taken over all classes.
Computational Method
We can implement this computationally by the following:
Define two variables, and
To classify a point, , first compute and .
Classify it to class if and vise versa.
(note: since is a constant term, we can simply ignore it in the actual computation since it will cancel out when we do the comparison of the deltas.)
Special Case: , the identity matrix. Then,
We see that in the case , we can simply classify a point, , to a class based on the distances between and the mean of the different classes (adjusted with the log of the prior).
General Case:
(since is symmetric)
So,
where and
Hence the approach taken should be to transpose point x from the beginning,
i.e. Let
Then compute and with x^{ * }, similar to the special case above.
If the prior distributions of the 2 classes are the same, then this method only requires us to find the distances from the point x to the mean of the 2 classes. We would classify x based on the shortest distance to the mean.
In the δ function calculations above, π_{k} = P(X = k) and π_{l} = P(X = l), and can be approximated using the proportions of k and l elements in the training set.
More on Quadratic Discriminant Analysis  July 28
The curse of dimensionality is a problem in many fields, including computer science and statistics. Loosely stated, it says that many problems in computer science and statistics become impractical as the number of dimensions increases.
For example, the number of points needed to accurately estimate parameters increases with the number of parameters. For linear discriminant analysis we have to estimate elements of the covariance matrix and elements for the mean, for a total of . For quadratic discriminant analysis, we have to estimate a second covariance matrix, so we estimate parameters in total. Clearly we have to estimate more parameters for QDA than LDA, so we need more points to get an accurate estimate for the discriminant.
Estimation:
We can find very flexible boundaries with LDA by expanding the input domain. For example, suppose has two features and can be represented as then,
We can expand to and get a quadratic boundary.
We can expand to and get a cubic boundary.
ect..
In Matlab:
%To double the input: load 2_3; [U, Y] = princomp(X'); ll=zeros(N,1); Y1=[Y Y.^2]; [C,err,p,logp]=classify(Y1,Y1,ll,"linear") %Now it classify using not only y but also y^2.