Independant Principle Component Analysis (IPCA) is an extremely useful tool when standard PCA is limited, usually induced by large quantites of noise in the dataset being analysed. This case study will draw many parallels to the process in the PCA Multidrug Case Study. Choosing IPCA over PCA requires an understanding of the variation structure of the data under inspection.
Load the latest version of mixOmics:
library(mixOmics)
The liver toxicity dataset was generated in a study in which rats were subjected to varying levels of acetaminophen (Bushel et al., 2007).
The mixOmics
liver toxicity dataset is accessed via
liver.toxicity
and contains the following:
liver.toxicity$gene
(continuous matrix): 64 rows and
3116 columns. The expression measure of 3116 genes for the 64 subjects
(rats).
liver.toxicity$clinic
(continuous matrix): 64 rows
and 10 columns, containing 10 clinical variables for the same 64
subjects.
liver.toxicity$treatment
(continuous/categorical
matrix): 64 rows and 4 columns, containing information on the treatment
of the 64 subjects, such as doses of acetaminophen and times of
necropsy.
To confirm the correct dataframe was extract, the dimensions are checked:
data(liver.toxicity) # call liver toxicity dataset
X <- liver.toxicity$gene # extract gene expression data
dim(X) # confirm the dimension of data
## [1] 64 3116
One of the primary reasons to use IPCA rather than PCA is when the data does not follow a Gaussian distribution. Figure 1 shows that when undergoing a Shapiro-Wilk test (the null hypothesis being that the data is distributed normally), nearly half of the 3116 features are unlikely to be distributed in a Gaussian way. This would violate the assumption of normality PCA has, hence making IPCA a more appropriate choice.
p.values = c() # initialise empty list to carry p-values
# for each column of the dataframe, undergo a shapiro-wilk test
# and extract the p-value
for (col in 1:dim(X)[2]) {
p.values[col] <- shapiro.test(X[, col])$p.value
}
h <- hist(p.values, breaks = 20, plot = FALSE) # create the histogram object
h$density <- h$counts/sum(h$counts) # adjust the frequency to a density
plot(h,freq=FALSE, ylim = c(0, 0.5)) # plot the histogram
FIGURE 1: Histogram depicting the distribution of p-values generated by running a Shaprio-Wilk test on each feature in the liver toxicity gene dataset. The large value <0.05 indicates that there are many features which are significanly unlikely to be distributed normally.
As in the sPCA Case Study, a basic model is formed first to inspect the data prior to any sort of tuning. This will aid in the tuning process down the line.
toxic.sipca <- sipca(X, ncomp = 10) # run preliminary model
plot(toxic.sipca) # plot the explained variance per component
FIGURE 2: Explained variance of Independent Principal Components on the Liver Toxicity Gene data
Note that Figure 2 is not a strictly decreasing plot (as in the equivalent plot in the PCA Multidrug Case Study). Components are not generated based on maximisation of explained variance but maximal reduction in noise. It seems from Figure 2 that the first component explains a high proportion of the variance observed in the data.
By default, the data will not be scaled at all
(scale = FALSE
). Generally, it is advised that scaling to
unit variance be done with the data, especially in cases where there is
inhomogeneity in variance across the variables.
The number of IPCs to select is an open issue. Rather than using the
explained variance, the Kurtosis value can be used. The ‘elbow’ method
is still applicable on the Kurtosis values. This is accessed via
ipca.Object$kurtosis
. The kurtosis value is described in
more detail below. Figure 3 shows that three components is an
appropriate choice.
barplot(toxic.sipca$kurtosis, ylim = c(0, 32),
names.arg = seq(1, 10, 1),
xlab = "Independent Principal Components",
ylab = "Kurtosis value")
FIGURE 3: Kurtosis values of Independent Principal Components on the Liver Toxicity Gene data
sipca()
uses the same keepX
parameter as
spca()
resulting in the same difficulty in selecting the
number of variables to be used for component construction. The
Davies Bouldin index is a measure that has been used to
optimise this decision (Yao et al., 2012). User experimentation with
different values for keepX
and their resulting Davies
Bouldin index is recommended. This index is also explained in depth
further below.
As there is no implementation of a tune.sipca()
function
within mixOmics
which utilises this measure, it would need
to be done manually. It is recommended to use the
index.DB()
function from the clusterSim
package. For this case study, 50 features will be used for each
component.
Using these tuned parameters, the final model can now be run to yield an optimised sIPCA visualisation.
# based off figure 3, three components is best
# using the default keepX, c(50, 50, 50) in this case
final.sipca <- sipca(X, ncomp = 3)
# generate sPCA on same dataset
final.spca <- spca(X, ncomp = 3, scale = TRUE, keepX = final.sipca$keepX)
# plot the samples of sPCA
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE,
title = '(a) Liver Toxicity Genes, sPCA comp 1 - 2')
# plot the samples of sIPCA
plotIndiv(final.sipca, comp = c(1, 2), ind.names = TRUE,
title = '(b) Liver Toxicity Genes, sIPCA comp 1 - 2')
FIGURE 4: Sample plot from the sPCA (a) and sIPCA (b) performed on the liver toxicity gene data.
Sample projections onto the first two components from the equivalent
sPCA and sIPCA outputs can be seen in Figure 4 (a) and (b) respectively.
Seeing as are multiple response variables in the
liver.toxicity
data, the samples were not coloured. The
output from sPCA (a) was included to depict the improved clustering seen
in sIPCA (b) on this sort of data.
# plot features against the sIPCA components
plotVar(final.sipca, comp = c(1, 2), var.names = FALSE,
title = 'Liver Toxicity Genes, sIPCA comp 1 - 2')
FIGURE 5: Correlation circle plot from the sIPCA performed on the Liver Toxicity genedata. Only the gene genes selected by the sIPCA are shown on this plot.
The correlation circle plot can be seen in Figure 5. Clustering of the gene expression features are shown - it seems as if there are about four clusters. From Figure 5, it can be observed that all the selected features were large contributors to the first two Independent Principal Components.