# Case Study of sIPCA with Liver Toxicity dataset

Independant Principle Component Analysis (IPCA) is an extremely useful tool when standard PCA is limited, usually induced by large quantites of noise in the dataset being analysed. This case study will draw many parallels to the process in the PCA Multidrug Case Study. Choosing IPCA over PCA requires an understanding of the variation structure of the data under inspection.

For background information on the (s)IPCA method, refer to the IPCA Methods Page.

# Rscript

The R script used for all the analysis in this case study is available here.

## To begin

Load the latest version of mixOmics:

```
library(mixOmics)
```

### The data

The liver toxicity dataset was generated in a study in which rats were subjected to varying levels of acetaminophen (Bushel et al., 2007).

The `mixOmics`

liver toxicity dataset is accessed via `liver.toxicity`

and contains the following:

`liver.toxicity$gene`

(continuous matrix): 64 rows and 3116 columns. The expression measure of 3116 genes for the 64 subjects (rats).`liver.toxicity$clinic`

(continuous matrix): 64 rows and 10 columns, containing 10 clinical variables for the same 64 subjects.`liver.toxicity$treatment`

(continuous/categorical matrix): 64 rows and 4 columns, containing information on the treatment of the 64 subjects, such as doses of acetaminophen and times of necropsy.

To confirm the correct dataframe was extract, the dimensions are checked:

```
data(liver.toxicity) # call liver toxicity dataset
X <- liver.toxicity$gene # extract gene expression data
dim(X) # confirm the dimension of data
```

## [1] 64 3116

## When to choose IPCA over PCA

One of the primary reasons to use IPCA rather than PCA is when the data does not follow a Gaussian distribution. Figure 1 shows that when undergoing a Shapiro-Wilk test (the null hypothesis being that the data is distributed normally), nearly half of the 3116 features are unlikely to be distributed in a Gaussian way. This would violate the assumption of normality PCA has, hence making IPCA a more appropriate choice.

```
p.values = c() # initialise empty list to carry p-values
# for each column of the dataframe, undergo a shapiro-wilk test
# and extract the p-value
for (col in 1:dim(X)[2]) {
p.values[col] <- shapiro.test(X[, col])$p.value
}
h <- hist(p.values, breaks = 20, plot = FALSE) # create the histogram object
h$density <- h$counts/sum(h$counts) # adjust the frequency to a density
plot(h,freq=FALSE, ylim = c(0, 0.5)) # plot the histogram
```

FIGURE 1: Histogram depicting the distribution of p-values generated by running a Shaprio-Wilk test on each feature in the liver toxicity gene dataset. The large value <0.05 indicates that there are many features which are significanly unlikely to be distributed normally.

## Initial Analysis

As in the sPCA Case Study, a basic model is formed first to inspect the data prior to any sort of tuning. This will aid in the tuning process down the line.

```
toxic.sipca <- sipca(X, ncomp = 10) # run preliminary model
plot(toxic.sipca) # plot the explained variance per component
```

FIGURE 2: Explained variance of Independent Principal Components on the Liver Toxicity Gene data

Note that Figure 2 is not a strictly decreasing plot (as in the equivalent plot in the PCA Multidrug Case Study). Components are not generated based on maximisation of explained variance but maximal reduction in noise. It seems from Figure 2 that the first component explains a high proportion of the variance observed in the data.

## Tuning sIPCA

### Scaling the data

#### The **scale** Parameters

By default, the data will not be scaled at all (`scale = FALSE`

). Generally, it is advised that scaling to unit variance be done with the data, especially in cases where there is inhomogeneity in variance across the variables.

### Selecting the number of components

#### The **ncomp** Parameter

The number of IPCs to select is an open issue. Rather than using the explained variance, the Kurtosis value can be used. The ‘elbow’ method is still applicable on the Kurtosis values. This is accessed via `ipca.Object$kurtosis`

. The kurtosis value is described in more detail below. Figure 3 shows that three components is an appropriate choice.

```
barplot(toxic.sipca$kurtosis, ylim = c(0, 32),
names.arg = seq(1, 10, 1),
xlab = "Independent Principal Components",
ylab = "Kurtosis value")
```

FIGURE 3: Kurtosis values of Independent Principal Components on the Liver Toxicity Gene data

### Selecting the number of variables

#### The **keepX** Parameter

`sipca()`

uses the same `keepX`

parameter as `spca()`

resulting in the same difficulty in selecting the number of variables to be used for component construction. The *Davies Bouldin* index is a measure that has been used to optimise this decision (Yao et al., 2012). User experimentation with different values for `keepX`

and their resulting Davies Bouldin index is recommended. This index is also explained in depth further below.

As there is no implementation of a `tune.sipca()`

function within `mixOmics`

which utilises this measure, it would need to be done manually. It is recommended to use the `index.DB()`

function from the `clusterSim`

package. For this case study, 50 features will be used for each component.

## Final Model

Using these tuned parameters, the final model can now be run to yield an optimised sIPCA visualisation.

```
# based off figure 3, three components is best
# using the default keepX, c(50, 50, 50) in this case
final.sipca <- sipca(X, ncomp = 3)
```

## Plots

### Sample Plots

```
# generate sPCA on same dataset
final.spca <- spca(X, ncomp = 3, scale = TRUE, keepX = final.sipca$keepX)
# plot the samples of sPCA
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE,
title = '(a) Liver Toxicity Genes, sPCA comp 1 - 2')
# plot the samples of sIPCA
plotIndiv(final.sipca, comp = c(1, 2), ind.names = TRUE,
title = '(b) Liver Toxicity Genes, sIPCA comp 1 - 2')
```

FIGURE 4: Sample plot from the sPCA (a) and sIPCA (b) performed on the liver toxicity gene data.

Sample projections onto the first two components from the equivalent sPCA and sIPCA outputs can be seen in Figure 4 (a) and (b) respectively. Seeing as are multiple response variables in the `liver.toxicity`

data, the samples were not coloured. The output from sPCA (a) was included to depict the improved clustering seen in sIPCA (b) on this sort of data.

### Variable Plots

```
# plot features against the sIPCA components
plotVar(final.sipca, comp = c(1, 2), var.names = FALSE,
title = 'Liver Toxicity Genes, sIPCA comp 1 - 2')
```

FIGURE 5: Correlation circle plot from the sIPCA performed on the Liver Toxicity genedata. Only the gene genes selected by the sIPCA are shown on this plot.

The correlation circle plot can be seen in Figure 5. Clustering of the gene expression features are shown – it seems as if there are about four clusters. From Figure 5, it can be observed that all the selected features were large contributors to the first two Independent Principal Components.

### More information on Plots

For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages:

## Addition Notes

### Kurtosis

The kurtosis measure is used to order the Independent Principal Components (IPCs). A value of zero indicates the variable has a Gaussian distribution. Increasing the magnitude indicates a greater deviation from a Gaussian distribution. Greater deviations are desirable due to the non-Gaussian nature of IPCA.

It has been shown that the kurtosis value is a good post-hoc indicator of the number of components to choose, as a sudden drop in the values corresponds to irrelevant components.

### Davies Bouldin Index

This value is the ratio of the intracluster (within-cluster) scatter and intercluster (between-cluster) scatter. Low values indicate good clustering, such that points within one cluster are tight and differing clusters are well-defined from one another.

## References

- Bushel, P.R., Wolfinger, R.D. and Gibson, G., 2007. Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology, 1(1), p.1.
- Yao, Fangzhou & Coquery, Jeff & Cao, Kim-Anh. (2012). Independent Principal Component Analysis for biologically meaningful dimension reduction of large biological data sets. BMC bioinformatics. 13. 24. 10.1186/1471-2105-13-24.