This case study explores how to use sparse Independent PCA (sIPCA) to analyse gene expression data from rats exposed to acetaminophen. sIPCA is helpful when data contains noise or doesn’t follow a normal distribution—common in many biological datasets. Follow along to see when sIPCA is a better choice than PCA, and how to build and interpret an optimised model.
🔍 More on sIPCA
📄 Download R script
Data used on this page:liver.toxicity
Key functions used on this page:sipca()
plotIndiv()
plotVar()
Additional notes:
Kurtosis
The kurtosis measure is used to order the Independent Principal Components (IPCs). A value of zero indicates the variable has a Gaussian distribution. Increasing the magnitude indicates a greater deviation from a Gaussian distribution. Greater deviations are desirable due to the non-Gaussian nature of IPCA.
It has been shown that the kurtosis value is a good post-hoc indicator of the number of components to choose, as a sudden drop in the values corresponds to irrelevant components.
Davies Bouldin Index
This value is the ratio of the intracluster (within-cluster) scatter and intercluster (between-cluster) scatter. Low values indicate good clustering, such that points within one cluster are tight and differing clusters are well-defined from one another.
References:
1. Bushel, P.R., Wolfinger, R.D. and Gibson, G., 2007. Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology, 1(1), p.1.
2. Yao, Fangzhou & Coquery, Jeff & Cao, Kim-Anh. (2012). Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets. BMC bioinformatics. 13. 24. 10.1186/1471-2105-13-24.