Case Study of sIPCA with Liver Toxicity dataset
Independant Principle Component Analysis (IPCA) is an extremely useful tool when standard PCA is limited, usually induced by large quantites of noise in the dataset being analysed. This case study will draw many parallels to the process in the PCA Multidrug Case Study. Choosing IPCA over PCA requires an understanding of the variation structure of the data under inspection.
For background information on the (s)IPCA method, refer to the IPCA Methods Page.
Rscript
The R script used for all the analysis in this case study is available here.
To begin
Load the latest version of mixOmics:
library(mixOmics)
The data
The liver toxicity dataset was generated in a study in which rats were subjected to varying levels of acetaminophen (Bushel et al., 2007).
The mixOmics
liver toxicity dataset is accessed via liver.toxicity
and contains the following:
liver.toxicity$gene
(continuous matrix): 64 rows and 3116 columns. The expression measure of 3116 genes for the 64 subjects (rats).liver.toxicity$clinic
(continuous matrix): 64 rows and 10 columns, containing 10 clinical variables for the same 64 subjects.liver.toxicity$treatment
(continuous/categorical matrix): 64 rows and 4 columns, containing information on the treatment of the 64 subjects, such as doses of acetaminophen and times of necropsy.
To confirm the correct dataframe was extract, the dimensions are checked:
data(liver.toxicity) # call liver toxicity dataset
X <- liver.toxicity$gene # extract gene expression data
dim(X) # confirm the dimension of data
## [1] 64 3116
When to choose IPCA over PCA
One of the primary reasons to use IPCA rather than PCA is when the data does not follow a Gaussian distribution. Figure 1 shows that when undergoing a Shapiro-Wilk test (the null hypothesis being that the data is distributed normally), nearly half of the 3116 features are unlikely to be distributed in a Gaussian way. This would violate the assumption of normality PCA has, hence making IPCA a more appropriate choice.
p.values = c() # initialise empty list to carry p-values
# for each column of the dataframe, undergo a shapiro-wilk test
# and extract the p-value
for (col in 1:dim(X)[2]) {
p.values[col] <- shapiro.test(X[, col])$p.value
}
h <- hist(p.values, breaks = 20, plot = FALSE) # create the histogram object
h$density <- h$counts/sum(h$counts) # adjust the frequency to a density
plot(h,freq=FALSE, ylim = c(0, 0.5)) # plot the histogram
FIGURE 1: Histogram depicting the distribution of p-values generated by running a Shaprio-Wilk test on each feature in the liver toxicity gene dataset. The large value <0.05 indicates that there are many features which are significanly unlikely to be distributed normally.
Initial Analysis
As in the sPCA Case Study, a basic model is formed first to inspect the data prior to any sort of tuning. This will aid in the tuning process down the line.
toxic.sipca <- sipca(X, ncomp = 10) # run preliminary model
plot(toxic.sipca) # plot the explained variance per component
FIGURE 2: Explained variance of Independent Principal Components on the Liver Toxicity Gene data
Note that Figure 2 is not a strictly decreasing plot (as in the equivalent plot in the PCA Multidrug Case Study). Components are not generated based on maximisation of explained variance but maximal reduction in noise. It seems from Figure 2 that the first component explains a high proportion of the variance observed in the data.
Tuning sIPCA
Scaling the data
The scale Parameters
By default, the data will not be scaled at all (scale = FALSE
). Generally, it is advised that scaling to unit variance be done with the data, especially in cases where there is inhomogeneity in variance across the variables.
Selecting the number of components
The ncomp Parameter
The number of IPCs to select is an open issue. Rather than using the explained variance, the Kurtosis value can be used. The ‘elbow’ method is still applicable on the Kurtosis values. This is accessed via ipca.Object$kurtosis
. The kurtosis value is described in more detail below. Figure 3 shows that three components is an appropriate choice.
barplot(toxic.sipca$kurtosis, ylim = c(0, 32),
names.arg = seq(1, 10, 1),
xlab = "Independent Principal Components",
ylab = "Kurtosis value")
FIGURE 3: Kurtosis values of Independent Principal Components on the Liver Toxicity Gene data
Selecting the number of variables
The keepX Parameter
sipca()
uses the same keepX
parameter as spca()
resulting in the same difficulty in selecting the number of variables to be used for component construction. The Davies Bouldin index is a measure that has been used to optimise this decision (Yao et al., 2012). User experimentation with different values for keepX
and their resulting Davies Bouldin index is recommended. This index is also explained in depth further below.
As there is no implementation of a tune.sipca()
function within mixOmics
which utilises this measure, it would need to be done manually. It is recommended to use the index.DB()
function from the clusterSim
package. For this case study, 50 features will be used for each component.
Final Model
Using these tuned parameters, the final model can now be run to yield an optimised sIPCA visualisation.
# based off figure 3, three components is best
# using the default keepX, c(50, 50, 50) in this case
final.sipca <- sipca(X, ncomp = 3)
Plots
Sample Plots
# generate sPCA on same dataset
final.spca <- spca(X, ncomp = 3, scale = TRUE, keepX = final.sipca$keepX)
# plot the samples of sPCA
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE,
title = '(a) Liver Toxicity Genes, sPCA comp 1 - 2')
# plot the samples of sIPCA
plotIndiv(final.sipca, comp = c(1, 2), ind.names = TRUE,
title = '(b) Liver Toxicity Genes, sIPCA comp 1 - 2')
FIGURE 4: Sample plot from the sPCA (a) and sIPCA (b) performed on the liver toxicity gene data.
Sample projections onto the first two components from the equivalent sPCA and sIPCA outputs can be seen in Figure 4 (a) and (b) respectively. Seeing as are multiple response variables in the liver.toxicity
data, the samples were not coloured. The output from sPCA (a) was included to depict the improved clustering seen in sIPCA (b) on this sort of data.
Variable Plots
# plot features against the sIPCA components
plotVar(final.sipca, comp = c(1, 2), var.names = FALSE,
title = 'Liver Toxicity Genes, sIPCA comp 1 - 2')
FIGURE 5: Correlation circle plot from the sIPCA performed on the Liver Toxicity genedata. Only the gene genes selected by the sIPCA are shown on this plot.
The correlation circle plot can be seen in Figure 5. Clustering of the gene expression features are shown – it seems as if there are about four clusters. From Figure 5, it can be observed that all the selected features were large contributors to the first two Independent Principal Components.
More information on Plots
For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages:
Addition Notes
Kurtosis
The kurtosis measure is used to order the Independent Principal Components (IPCs). A value of zero indicates the variable has a Gaussian distribution. Increasing the magnitude indicates a greater deviation from a Gaussian distribution. Greater deviations are desirable due to the non-Gaussian nature of IPCA.
It has been shown that the kurtosis value is a good post-hoc indicator of the number of components to choose, as a sudden drop in the values corresponds to irrelevant components.
Davies Bouldin Index
This value is the ratio of the intracluster (within-cluster) scatter and intercluster (between-cluster) scatter. Low values indicate good clustering, such that points within one cluster are tight and differing clusters are well-defined from one another.
References
- Bushel, P.R., Wolfinger, R.D. and Gibson, G., 2007. Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology, 1(1), p.1.
- Yao, Fangzhou & Coquery, Jeff & Cao, Kim-Anh. (2012). Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets. BMC bioinformatics. 13. 24. 10.1186/1471-2105-13-24.