Independant Principal Component Analysis
In some case studies, we have identified some limitations when using PCA:
PCA assumes that gene expression follows a multivariate normal distribution and recent studies have demonstrated that microarray gene expression measurements follow instead a super-Gaussian distribution.
PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data.
Instead, we propose to apply Independent Principal Component Analysis (IPCA) which combines the advantages of both PCA and Independent Component Analysis (ICA). It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. A sparse version is also proposed (sIPCA). This approach was proposed in collaboration with Eric F. Yao (QFAB and University of Shanghai).
The algorithm of IPCA is as follows:
The original data matrix is centered (by default).
PCA is used to reduce dimension and generate the loading vectors.
ICA (FastICA) is implemented on the loading vectors to generate independent loading vectors.
The centered data matrix is projected on the independent loading vectors to obtain the independent principal components.
IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA.
Choosing the optimal parameters
The number of variables to select is still an open issue. In Yao et al (2012) we proposed to use the Davies Bouldin measure which is an index of crisp cluster validity. This index compares the within-cluster scatter with the between-cluster separation.
IPCA is of class sPCA and PCA, and most of the PCA graphical methods can be applied. The default algorithm to estimate the unmixing matrix is set to mode = ‘deflation’. By default, the data are centered, but not necessarily scaled.
library(mixOmics) data("liver.toxicity") ipca.res <- ipca(liver.toxicity$gene, ncomp = 3, mode="deflation") #ipca.res
See Case Study: IPCA Liver Toxicity for plotting options.
The kurtosis measure is used to order the loading vectors to order the Independent Principal Components. We have shown that the kurtosis value is a good post hoc indicator of the number of components to choose, as a sudden drop in the values corresponds to irrelevant dimensions.
##  9.7068221 6.9869933 0.6729702
See Case Study: IPCA Liver toxicity for more examples and plotting options.