What is Gibbs Sampling?

In statistics and machine learning, Gibbs Sampling is a potent Markov Chain Monte Carlo (MCMC) technique that is frequently utilized for sampling from intricate, high-dimensional probability distributions. The foundational ideas, mathematical formulas, and algorithm of Gibbs Sampling are examined in this article.

Gibbs Sampling

Gibbs sampling is a type of Markov Chain Monte Carlo(MCMC) method in which we sample from a set of multivariate(having different variables) probability distributions, it is typically considered to be difficult to sample joint distribution from the multivariate distribution so we consider conditional probability. Instead of calculating joint probability p(x ₁ , x ₂ , x ₃ …..x _n ), we will be calculating conditionals p(x _i |x ₂ , x ₃ ….x _n ) for every variable. The concept behind Gibbs Sampling is that we can update the value of variable x _i keeping other variables having the current value. The process of updating variables and keeping others fixed creates a Markov Chain, after several iterations the chain converges to the joint distribution. Rather than finding the sample from joint distribution, we will prefer to find samples from conditional distribution for multivariate probability distribution which then tends to give us the joint distribution sampling.

Markov Chain Monte Carlo Method

Let’s discuss about Markov Chain Monte Carlo (MCMC) method in a little detail so that we can get intuition about the Gibbs Sampling.

In a Markov Chain, the future state depends on the current state and is independent of the previous state. To get a sample from the joint distribution, a Markov chain is created such that its stationary state corresponds to the target distribution. Until the stationary state is reached the samples taken into account are considered to be burn in state.

Monte Carlo methods corresponds to the random sampling of distribution when direct sampling from the distribution is not feasible.

In MCMC a random sample chain is formed X _i -> X _i+1 with the help of transition probability (T(X _i+1 |X _i )) which is actually the probability that the sample state X _i will switch to X _i+1 which is an implication that X _i+1 marks higher density than X _i in the p(x). We need to iterate over the process of switching from one sample to another until a series of stationary samples are obtained which tells us about the final state of samples, after this state the samples are not going to change and we can even match the stationary state with the target sample.

Let’s say that we start the sampling from X ₀ , therefore the sampling process goes on like a chain – X ₀ -> X ₁ -> X ₂ -> ………. X _m -> X _m -> X _m+2 -> …and so on.

Here lets say that the samples starts matching the target distribution from X _m , so the steps taken to reach X _m form X ₁ to X _m-1 is considered to be burn in phase. The burn in phase is required to reach the required target samples.

The stationary state is reached when the distribution satisfies the equation =

$\Sigma p(x) * T(y|x) = \Sigma p(y) * T(x|y)$

where T(y|x) is the transition probability from y -> x and T(x|y) is the transition probability from x -> y

For the sake of simplicity let us consider the distribution is among three variables X, Y, Z and we want to sample joint distribution p(X, Y, Z).

Gibbs Sampling Algorithm

First initialize each variable present in the system by assigning values to them.
Then select any variable from the system X _i , the selection choice of variable is not important.
Now sample a new value of X _i from the conditional distribution given other variables are the same p(X _i |X ₁ , X ₂ ,…,X _i-1 ,X _i+1 ,……X _n ) and update the value of X _i in the sample.
Repeat the iteration for a number of iterations with one variable at a time and for all the variables present in the system.
A Markov Chain is created in the process which converges to the target distribution, we have to discard the initial values which led us to the target sample i.e. we have to discard the burn in phase.

Psuedocode for Gibbs Sampling Algorithm

Initialize variables randomly or with some initial values
for each iteration t do:
for each variable i do:
Sample x_i from P(x_i | x_1, x_2, . x_, x_, . x_n)
Update x_i in the current state
The above process is repeated for a sufficient number of iterations to allow the chain to converge to the target distribution.

The variables are initialized either randomly or with predetermined beginning values at the start of the Gibbs Sampling process. After that, iteratively cycling through each variable, it updates its value according to the conditional distribution based on how the other variables are currently doing. Taking into account the values of the nearby variables, a sample is taken from the conditional distribution of each variable x _i . The current state is then updated using this sampled value. After a significant number of repetitions, the Markov chain is able to explore the joint distribution and eventually converge to the target distribution. This procedure is repeated. High-dimensional probability spaces can be effectively navigated by Gibbs Sampling due to its sequential variable update and dependence on conditional distributions.

Gibbs Sampling Function

Gibbs sampling involves various mathematical functions and expressions. Here are some common ones:

Conditional Probability : We can express the conditional probability as P(x | y) , we use the vertical bar (|) within a fraction: [ P(x | y) ]
Gaussian Distribution : We can use the symbol to represent the Gaussian distribution (normal distribution). The parameters (mean and variance) can be specified in subscript and superscript: [ ]
Summation : We can represent a summation, we can use the symbol: This example represents the sum of x subscripts from 1 to n.
Random variables : We can represent random variables as [X, Y, Z].
Mean and variance : is used to represent mean whereas is used to represent variance , [ ]

Therefore the conditional probability of x given y can be represented using gaussian distribution using mean( ) and variance( ) as:

$\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x - \mu)^2}{2\sigma^2}}$

P(x |y) =

Steps to implement Gibbs Sampling Algorithm

First we need to understand the problem and know the conditional probability of the variables we need to sample. Let’s say we have three variables X, Y, Z and we want to find their joint probability.
After that we need to consider initial values for X, Y and Z, let’s say they are X ₀ , Y ₀ , Z ₀ .
Now we find the value of X keeping the value of Y and Z constant as the current values and randomly decide if we should change the value of X to the new value or the initial value, if we decide to take the new value we must update X from X ₀ to X ₁ .
In the same way we find the value of Y keeping the value of X and Z constant as the current values and randomly decide if we should change the value of Y to the new value or the initial value, if we decide to take the new value we must update Y from Y ₀ to Y ₁ .
Again we find the value of Z keeping the value of X and Y constant as the current values and randomly decide if we should change the value of Z to the new value or the initial value, if we decide to take the new value we must update Z from Z ₀ to Z ₁ .
We keep repeating the third, fourth and the fifth steps to get new values of X, Y, Z and store their set (X _i , Y _j , Z _k ) which will correspond to new samples.
After a period the samples stored will converge to the target joint probability, we must consider recording the start of the joint probability and ignore the steps which were required to reach that distribution as those steps are considered to be burn in phase.

We can repeat what we just did with 3 variable distribution with ‘n’ variable distribution keeping ‘n-1’ variables constant.

Example of Gibbs sampling

Suppose we have two variables(X, Y) which takes two values 0 and 1, we want to sample from their joint distribution p(X, Y) which is defined as follows:

Let’s follow the above instructions to find out the join probability:

Initialize X and Y, let’s say X=0 and Y=0

p(X=0|Y=0) = p(X=0,Y=0)/p(Y=0) = 0.2 / (0.3 + 0.2) = 0.4
p(X=1|Y=0) = 1 – p(X=0, Y=0) = 1 – 0.4 = 0.6

We can see that the p(X=1|Y=0) > p(X=0|Y=0), therefore let’s consider X = 1

p(Y=0|X=1) = 0.3 / (0.3 + 0.4) = 0.429
p(Y=1|X=1) = 0.4 / (0.3 + 0.4) = 0.571

We can see that the p(Y=1|X=1) > p(Y=0|X=1), therefore let’s consider Y = 1

Now we get new sample (X = 1, Y = 1) from initial (X = 0, Y = 0) sample.

Implementation of Gibbs Sampling

Lets use real world dataset to perform Gibbs sampling on it, for this example we will be using iris flower dataset which has 4 features namely Sepal Length in cm, Sepal Width in cm, Petal Length in cm, Petal Width in cm and a target column named “Species” which specifies which specie of flower the given flower information belongs to. We will be using three features out of the features column and sampling from them, the three features that we are going to consider are Sepal Length in cm, Sepal Width in cm and Petal Length in cm and we will be sampling using Gibbs sampling. Now that we have the knowledge about the dataset lets sample from it to get more insightful and concrete knowledge about Gibbs Sampling. The link to get the iris dataset is present here – link.