In statistics and machine learning, Gibbs Sampling is a potent Markov Chain Monte Carlo (MCMC) technique that is frequently utilized for sampling from intricate, high-dimensional probability distributions. The foundational ideas, mathematical formulas, and algorithm of Gibbs Sampling are examined in this article.
Gibbs sampling is a type of Markov Chain Monte Carlo(MCMC) method in which we sample from a set of multivariate(having different variables) probability distributions, it is typically considered to be difficult to sample joint distribution from the multivariate distribution so we consider conditional probability. Instead of calculating joint probability p(x 1 , x 2 , x 3 …..x n ), we will be calculating conditionals p(x i |x 2 , x 3 ….x n ) for every variable. The concept behind Gibbs Sampling is that we can update the value of variable x i keeping other variables having the current value. The process of updating variables and keeping others fixed creates a Markov Chain, after several iterations the chain converges to the joint distribution. Rather than finding the sample from joint distribution, we will prefer to find samples from conditional distribution for multivariate probability distribution which then tends to give us the joint distribution sampling.
Let’s discuss about Markov Chain Monte Carlo (MCMC) method in a little detail so that we can get intuition about the Gibbs Sampling.
In a Markov Chain, the future state depends on the current state and is independent of the previous state. To get a sample from the joint distribution, a Markov chain is created such that its stationary state corresponds to the target distribution. Until the stationary state is reached the samples taken into account are considered to be burn in state.
Monte Carlo methods corresponds to the random sampling of distribution when direct sampling from the distribution is not feasible.
In MCMC a random sample chain is formed X i -> X i+1 with the help of transition probability (T(X i+1 |X i )) which is actually the probability that the sample state X i will switch to X i+1 which is an implication that X i+1 marks higher density than X i in the p(x). We need to iterate over the process of switching from one sample to another until a series of stationary samples are obtained which tells us about the final state of samples, after this state the samples are not going to change and we can even match the stationary state with the target sample.
Let’s say that we start the sampling from X 0 , therefore the sampling process goes on like a chain – X 0 -> X 1 -> X 2 -> ………. X m -> X m -> X m+2 -> …and so on.
Here lets say that the samples starts matching the target distribution from X m , so the steps taken to reach X m form X 1 to X m-1 is considered to be burn in phase. The burn in phase is required to reach the required target samples.
The stationary state is reached when the distribution satisfies the equation =
where T(y|x) is the transition probability from y -> x and T(x|y) is the transition probability from x -> y
For the sake of simplicity let us consider the distribution is among three variables X, Y, Z and we want to sample joint distribution p(X, Y, Z).
The variables are initialized either randomly or with predetermined beginning values at the start of the Gibbs Sampling process. After that, iteratively cycling through each variable, it updates its value according to the conditional distribution based on how the other variables are currently doing. Taking into account the values of the nearby variables, a sample is taken from the conditional distribution of each variable x i . The current state is then updated using this sampled value. After a significant number of repetitions, the Markov chain is able to explore the joint distribution and eventually converge to the target distribution. This procedure is repeated. High-dimensional probability spaces can be effectively navigated by Gibbs Sampling due to its sequential variable update and dependence on conditional distributions.
Gibbs sampling involves various mathematical functions and expressions. Here are some common ones:
Therefore the conditional probability of x given y can be represented using gaussian distribution using mean( ) and variance( ) as:
P(x |y) =
We can repeat what we just did with 3 variable distribution with ‘n’ variable distribution keeping ‘n-1’ variables constant.
Suppose we have two variables(X, Y) which takes two values 0 and 1, we want to sample from their joint distribution p(X, Y) which is defined as follows:
Let’s follow the above instructions to find out the join probability:
Initialize X and Y, let’s say X=0 and Y=0
We can see that the p(X=1|Y=0) > p(X=0|Y=0), therefore let’s consider X = 1
We can see that the p(Y=1|X=1) > p(Y=0|X=1), therefore let’s consider Y = 1
Now we get new sample (X = 1, Y = 1) from initial (X = 0, Y = 0) sample.
Lets use real world dataset to perform Gibbs sampling on it, for this example we will be using iris flower dataset which has 4 features namely Sepal Length in cm, Sepal Width in cm, Petal Length in cm, Petal Width in cm and a target column named “Species” which specifies which specie of flower the given flower information belongs to. We will be using three features out of the features column and sampling from them, the three features that we are going to consider are Sepal Length in cm, Sepal Width in cm and Petal Length in cm and we will be sampling using Gibbs sampling. Now that we have the knowledge about the dataset lets sample from it to get more insightful and concrete knowledge about Gibbs Sampling. The link to get the iris dataset is present here – link.