Case Study of sIPCA with Multidrug dataset

Independant Principle Component Analysis (IPCA) is an extremely useful tool when standard PCA is limited, usually induced by large quantites of noise in the dataset being analysed. This case study will draw many parallels to the process in the PCA Multidrug Case Study. Choosing IPCA over PCA requires an understanding of the variation structure of the data under inspection.

For background information on the (s)IPCA method, refer to the IPCA Methods Page.

Rscript

The R script used for all the analysis in this case study is available here.

To begin

library(mixOmics)

The data

The liver toxicity dataset was generated in a study in which rats were subjected to varying levels of acetaminophen (Bushel et al., 2007).

The mixOmics liver toxicity dataset is accessed via liver.toxicity and contains the following:

• liver.toxicity$gene (continuous matrix): 64 rows and 3116 columns. The expression measure of 3116 genes for the 64 subjects (rats). • liver.toxicity$clinic (continuous matrix): 64 rows and 10 columns, containing 10 clinical variables for the same 64 subjects.

• liver.toxicity$treatment (continuous/categorical matrix): 64 rows and 4 columns, containing information on the treatment of the 64 subjects, such as doses of acetaminophen and times of necropsy. To confirm the correct dataframe was extract, the dimensions are checked: data(liver.toxicity) # call liver toxicity dataset X <- liver.toxicity$gene # extract gene expression data
dim(X) # confirm the dimension of data
## [1]   64 3116

When to choose IPCA over PCA

One of the primary reasons to use IPCA rather than PCA is when the data does not follow a Gaussian distribution. Figure 1 shows that when undergoing a Shapiro-Wilk test (the null hypothesis being that the data is distributed normally), nearly half of the 3116 features are unlikely to be distributed in a Gaussian way. This would violate the assumption of normality PCA has, hence making IPCA a more appropriate choice.

p.values = c() # initialise empty list to carry p-values

# for each column of the dataframe, undergo a shapiro-wilk test
# and extract the p-value
for (col in 1:dim(X)[2]) {
p.values[col] <- shapiro.test(X[, col])$p.value } h <- hist(p.values, breaks = 20, plot = FALSE) # create the histogram object h$density <- h$counts/sum(h$counts) # adjust the frequency to a density
plot(h,freq=FALSE, ylim = c(0, 0.5)) # plot the histogram

FIGURE 1: Histogram depicting the distribution of p-values generated by running a Shaprio-Wilk test on each feature in the liver toxicity gene dataset. The large value <0.05 indicates that there are many features which are significanly unlikely to be distributed normally.

Initial Analysis

As in the sPCA Case Study, a basic model is formed first to inspect the data prior to any sort of tuning. This will aid in the tuning process down the line.

toxic.sipca <- sipca(X, ncomp = 10) # run preliminary model
plot(toxic.sipca) # plot the explained variance per component

FIGURE 2: Explained variance of Independent Principal Components on the Liver Toxicity Gene data

Note that Figure 2 is not a strictly decreasing plot (as in the equivalent plot in the PCA Multidrug Case Study). Components are not generated based on maximisation of explained variance but maximal reduction in noise. It seems from Figure 2 that the first component explains a high proportion of the variance observed in the data.

Tuning sIPCA

Scaling the data

The scale Parameters

By default, the data will not be scaled at all (scale = FALSE). Generally, it is advised that scaling to unit variance be done with the data, especially in cases where there is inhomogeneity in variance across the variables.

Selecting the number of components

The ncomp Parameter

The number of IPCs to select is an open issue. Rather than using the explained variance, the Kurtosis value can be used. The 'elbow' method is still applicable on the Kurtosis values. This is accessed via ipca.Object$kurtosis. The kurtosis value is described in more detail below. Figure 3 shows that three components is an appropriate choice. barplot(toxic.sipca$kurtosis, ylim = c(0, 32),
names.arg = seq(1, 10, 1),
xlab = "Independent Principal Components",
ylab = "Kurtosis value")

FIGURE 3: Kurtosis values of Independent Principal Components on the Liver Toxicity Gene data

Selecting the number of variables

The keepX Parameter

sipca() uses the same keepX parameter as spca() resulting in the same difficulty in selecting the number of variables to be used for component construction. The Davies Bouldin index is a measure that has been used to optimise this decision (Yao et al., 2012). User experimentation with different values for keepX and their resulting Davies Bouldin index is recommended. This index is also explained in depth further below.

As there is no implementation of a tune.sipca() function within mixOmicswhich utilises this measure, it would need to be done manually. It is recommended to use the index.DB() function from the clusterSim package. For this case study, 50 features will be used for each component.

Final Model

Using these tuned parameters, the final model can now be run to yield an optimised sIPCA visualisation.

# based off figure 3, three components is best
# using the default keepX, c(50, 50, 50) in this case
final.sipca <- sipca(X, ncomp = 3)

Plots

Sample Plots

# generate sPCA on same dataset
final.spca <- spca(X, ncomp = 3, scale = TRUE, keepX = final.sipca\$keepX)

# plot the samples of sPCA
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE,
title = '(a) Liver Toxicity Genes, sPCA comp 1 - 2')

# plot the samples of sIPCA
plotIndiv(final.sipca, comp = c(1, 2), ind.names = TRUE,
title = '(b) Liver Toxicity Genes, sIPCA comp 1 - 2')

FIGURE 4: Sample plot from the sPCA (a) and sIPCA (b) performed on the liver toxicity gene data.

Sample projections onto the first two components from the equivalent sPCA and sIPCA outputs can be seen in Figure 4 (a) and (b) respectively. Seeing as are multiple response variables in the liver.toxicity data, the samples were not coloured. The output from sPCA (a) was included to depict the improved clustering seen in sIPCA (b) on this sort of data.

Variable Plots

# plot features against the sIPCA components
plotVar(final.sipca, comp = c(1, 2), var.names = FALSE,
title = 'Liver Toxicity Genes, sIPCA comp 1 - 2')

FIGURE 5: Correlation circle plot from the sIPCA performed on the Liver Toxicity genedata. Only the gene genes selected by the sIPCA are shown on this plot.

The correlation circle plot can be seen in Figure 5. Clustering of the gene expression features are shown – it seems as if there are about four clusters. From Figure 5, it can be observed that all the selected features were large contributors to the first two Independent Principal Components.

For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages: