In this blog, we will walk through a probabilistic formulation of the well-known technique of principal component analysis (PCA). The probabilistic PCA (PPCA) was introduced by Bishop [1]. PPCA does not generate better results than PCA, but it permits a broad range of future extensions. Besides, PPCA falls in the category of Bayesian models, which are enjoying a renaissance. Other models, such as Gaussian mixture model, K-means clustering, factor analysis, and variational autoencoder, are all Bayesian models or can be viewed from a Bayesian perspective. So I hope this blog helps to view PCA from a Bayesian angle and grasp the basic framework of constructing a Bayesian model. In the following parts, I will briefly introduce PCA, then focus on how PPCA was formulated, how to train PPCA and inference in PPCA.

PCA is a dimensionality reduction technique that transforms the original data into a lower dimensionality space via an orthogonal transformation. PCA can be derived in two ways. One is to find a manifold of lower dimensionality that the data have a maximum variance when the data is projected in the lower dimensionality space. The idea of the other way is to minimize recovery error if we try to recover the original data from the compressed data.

PPCA is formulated as a Bayesian generative model. In such a framework, we define an explicit latent variable z corresponding to the principal-component subspace. Next, we relate the latent variable to the observed variable, x, via Bayesian theorem, which defines a rule to update the posterior distribution using conditional distribution and prior. We choose Gaussian distribution as the prior and conditional distribution due to the desired properties of Gaussian distribution. The PPCA is fully defined by the following six equations in which a lowercase/uppercase letter in bold represents a vector/matrix. First, note that equation 13 are assumptions and equation 46 are derived using the properties of Gaussian distribution. Here I describe the essence of each equation in one sentence.

1. We assume the latent variable z \in R^{k} is given by a zero-mean unit-covariance Gaussian.

2. We generate a higher-dimensional x via a mapping matrix W \in R^{d\times k}, plus a vector \mathbf{\mu}\in R^{d} and some noise.

3. Equation 3 defines a Gaussian variable, which is the noise part in equation 2.

4. The distribution of x, conditional on the latent variable z, follows a Gaussian distribution, whose mean and covriance matrix are \mathbf{Wz + \mu} and \sigma^2\mathbf{I}, respectively. \mathbf{W}, \mu, and \sigma are parameters that will be learned during the training process.

5. The integral of the joint distribution of \mathbf{x} and \mathbf{z} over \mathbf{z} yields the distribution of \mathbf{x}, which follows a Gaussian distribution, whose mean (\mathbf{\mu_\mathbf{x}}) and covariance matrix (\mathbf{C_x}) are equal to, \mathbf{\mu} and \mathbf{WW^T + \sigma^2I}, respectively. If you don’t care about the math details, the takeaway from this equation is that \mathbf{x} follows a Gaussian.

6. Equation 2 defines a way to map \mathbf{z} to \mathbf{x}. To “map” \mathbf{x} to \mathbf{z}, in a probabilistic language, we need to the distribution of \mathbf{z} conditional on \mathbf{x}, \mathbf{p(z|x)}. Due to properties of Gaussian distribution, \mathbf{p(z|x)} also follows a Gaussian distribution, whose mean is equal to (\mathbf{W^TW + \sigma^2I})^{-1}\mathbf{W^T}(\mathbf{x-\mu}). The \mathbf{C_z} is not used in inference, so I omit it here. \mathbf{W}, \mu, \sigma are the same as before.

(1)   \begin{equation*}  p(\mathbf{z}) = N(\mathbf{0}, \mathbf{I}) \end{equation*}

(2)   \begin{equation*}  \mathbf{x} = \mathbf{W}\mathbf{z} + \mathbf{\mu} + \mathbf{\epsilon} \end{equation*}

(3)   \begin{equation*}  \mathbf{\epsilon} = N(\mathbf{0}, \sigma^2\mathbf{I}) \end{equation*}

(4)   \begin{equation*}  p(\mathbf{x}|\mathbf{z}) = N(\mathbf{Wz} + \mathbf{\mu}, \sigma^2\mathbf{I}) \end{equation*}

(5)   \begin{equation*}  p(\mathbf{x}) = \int_\mathbf{z} p(\mathbf{x}|\mathbf{z})p(\mathbf{z})dz = N(\mathbf{\mu}_{\mathbf{x}}, \mathbf{C}_\mathbf{x}) \end{equation*}

(6)   \begin{equation*}  p(\mathbf{z}|\mathbf{x}) = \frac{p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}{p(\mathbf{x})} = N(\mathbf{\mu}_{\mathbf{z}}, \mathbf{C}_\mathbf{z}) \end{equation*}

So far, we have defined PPCA. Now it is time to train PPCA. Since the marginal likelihood of \mathbf{x} in equation 4 has an analytic form. We can maximize the p(\mathbf{x}) directly. An alternative way is to use the expectation-maximization (EM) algorithm which sequentially optimizes p(\mathbf{x}) with respect to the distribution of z and with respect to the parameters in each iteration. (EM is commonly used when training a latent variable model.)

Finally, with the trained parameters, \mathbf{W}, \mu, \sigma, we can use equation 6 to map \mathbf{x} to a lower-dimension space. For further details about the derivation and implementation in Python, see the references.

1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.