# Case study with PCA and sPLS on Liver Toxicity data set

The sparse Partial Least Squares (sPLS) approach implemented in **mixOmics** combines both integration and variable selection simultaneously on two data sets in a one-step strategy.

sPLS overcomes issues associated with PCA and CCA, which are often limited by the characteristics of the ‘omics data, see sPLS.

Here, we illustrate PCA and sPLS using the liver toxicity data set, see *?liver.toxicity*. The data set contains the expression measure of 3116 genes and 10 clinical measurements for 64 subjects (rats) that were exposed to non-toxic, moderately toxic or severely toxic doses of acetaminophen in a controlled experiment, see study details in [1].

## To begin…

Load the latest version of mixOmics.

```
library(mixOmics)
```

# Data

The liver toxicity data set is implemented in mixOmics via *liver.toxicity*, and contains the following:

**$gene:** data frame with 64 rows and 3116 columns. The expression measure of 3116 genes for the 64 subjects (rats).

**$clinic:** data frame with 64 rows and 10 columns, containing 10 clinical variables for the same 64 subjects.

**$treatment:** data frame with 64 rows and 4 columns, containing information on the treatment of the 64 subjects, such as doses of acetaminophen and times of necropsy.

```
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
# check that the subjects are matched in the two data sets
head(cbind(rownames(X), rownames(Y)))
```

## [,1] [,2] ## [1,] "ID202" "202" ## [2,] "ID203" "203" ## [3,] "ID204" "204" ## [4,] "ID206" "206" ## [5,] "ID208" "208" ## [6,] "ID209" "209"

We analyze these two data sets (genes and clinical measurements) using sPLS with a **regression mode** , in an attempt to explain/predict the clinical variables with respect to the gene expression levels:

# Preliminary analysis with PCA

Start a preliminary investigation with PCA analysis on the liver toxicity data. PCA is an unsupervised approach (eg. no information about the treatment is input in PCA). We perform two PCAs, one for the gene expression and one for the clinical variables.

```
pca.gene <- pca(X, ncomp = 10, center = TRUE, scale = TRUE)
pca.gene
```

## Eigenvalues for the first 10 principal components, see object$sdev^2: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## 874.77885 463.10098 248.15205 167.57834 141.95703 122.32390 108.76499 ## PC8 PC9 PC10 ## 77.74595 74.18373 56.45566 ## ## Proportion of explained variance for the first 10 principal components, see object$explained_variance: ## PC1 PC2 PC3 PC4 PC5 PC6 ## 0.28073776 0.14862034 0.07963801 0.05377996 0.04555745 0.03925671 ## PC7 PC8 PC9 PC10 ## 0.03490533 0.02495056 0.02380736 0.01811799 ## ## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## 0.2807378 0.4293581 0.5089961 0.5627761 0.6083335 0.6475902 0.6824956 ## PC8 PC9 PC10 ## 0.7074461 0.7312535 0.7493715 ## ## Other available components: ## -------------------- ## loading vectors: see object$rotation

```
plot(pca.gene)
```

The PCA numerical output shows that 50% of the total variance is explained with 3 principal components. The barplot above shows the variance explained per component.

Note that it is preferable to first run a PCA with a large number of components (e.g. 10), then visualize on the barplot when the ‘elbow’ (sudden drop) appear to choose the final number of PCs.

```
pca.clinical <- pca(Y, ncomp = 10, center = TRUE, scale = TRUE)
pca.clinical
```

## Eigenvalues for the first 10 principal components, see object$sdev^2: ## PC1 PC2 PC3 PC4 PC5 PC6 ## 4.083260500 2.219951947 1.247262301 0.894716350 0.675342967 0.483596842 ## PC7 PC8 PC9 PC10 ## 0.261923034 0.089045216 0.036440620 0.008460224 ## ## Proportion of explained variance for the first 10 principal components, see object$explained_variance: ## PC1 PC2 PC3 PC4 PC5 ## 0.4083260500 0.2219951947 0.1247262301 0.0894716350 0.0675342967 ## PC6 PC7 PC8 PC9 PC10 ## 0.0483596842 0.0261923034 0.0089045216 0.0036440620 0.0008460224 ## ## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var: ## PC1 PC2 PC3 PC4 PC5 PC6 PC7 ## 0.4083260 0.6303212 0.7550475 0.8445191 0.9120534 0.9604131 0.9866054 ## PC8 PC9 PC10 ## 0.9955099 0.9991540 1.0000000 ## ## Other available components: ## -------------------- ## loading vectors: see object$rotation

The PCA numerical output shows that 75% of the total variance is explained with 3 principal components.

# Sample Plots

```
plotIndiv(pca.gene, comp = c(1, 2), group = liver.toxicity$treatment[, 4],
ind.names = liver.toxicity$treatment[, 3],
legend = TRUE, title = 'Liver gene, PCA comp 1 - 2')
```

```
plotIndiv(pca.clinical, comp = c(1, 2), group = liver.toxicity$treatment[, 4],
ind.names = liver.toxicity$treatment[, 3],
legend = TRUE,title = 'Liver clinical, PCA comp 1 - 2')
```

# sPLS analysis

## Choosing the number of components…

sPLS is implemented in **mixOmics** with four modes, including *regression mode*, see [2], and *canonical mode* similar to a Canonical Correlation Analysis, see [3].

Here we analyze these two data sets (X and Y) using sPLS with a *regression mode* in an attempt to explain/predict the clinical variables with respect to the gene expression levels. First we use PLS and then follow with the sparse variation sPLS for comparison. sPLS combines both integration and variable selection, in a one-step strategy to maximize the co variance between two data sets and to identify latent variables. in sPLS the number of variables to select on each component need to be specified by the arguments *keepX* and *keepY*. Here *keepX* and *keepY* selects 10 genes on each sPLS component.

The first step of both analyses is to tune the key parameters. To do this we currently use the function perf(). **Note** while canonical mode is currently implemented, it does not yet have a *tune()* or *perf()* function.

```
liver.pls <- pls(X, Y, ncomp = 10, mode = "regression")
liver.spls <- spls(X, Y, ncomp =10, keepX = c(10,10,10), keepY= c(10,10,10), mode = "regression")
```

We use *perf()* using a 10-fold crossvalidation and *nrepeat = 10*. **Note:** it is recommended to set *nrepeat* to 50-100 and the code below is to present an example only.

```
tune.pls <- perf(liver.pls, validation = "Mfold", folds = 10, progressBar = FALSE, nrepeat = 10)
tune.spls <- perf(liver.spls, validation = "Mfold", folds = 10, progressBar = FALSE, nrepeat = 10)
```

```
plot(tune.pls$Q2.total)
abline(h = 0.0975)
```

```
plot(tune.spls$Q2.total)
abline(h = 0.0975)
```

The *Q2.total* can be used to tune the number of components using the perf() function, see [2]. The rule of thumbs is that a PLS component should be included in the model if its value is less than or equal to 0.0975. Therefore, the optimal number of component to choose is

indicated once the Q2 is below the threshold of 0.0975 [5].

```
tune.pls$Q2.total
```

## Q2.total ## 1 comp 0.269893996 ## 2 comp 0.042761376 ## 3 comp 0.089266289 ## 4 comp 0.034924304 ## 5 comp 0.017287067 ## 6 comp -0.044587808 ## 7 comp 0.028246189 ## 8 comp -0.062142764 ## 9 comp -0.028860451 ## 10 comp 0.009730501

```
tune.spls$Q2.total
```

## Q2.total ## 1 comp 0.28546834 ## 2 comp 0.14803353 ## 3 comp 0.02026250 ## 4 comp -0.01623416 ## 5 comp 0.02040158 ## 6 comp -0.08745901 ## 7 comp -0.01120717 ## 8 comp -0.01796435 ## 9 comp -0.03120767 ## 10 comp -0.01724609

As shown above the Q2 criterion indicates that 2 components would be sufficient in the model. Other evaluation criteria include R2 and MSEP.

Similarly with the ‘R2’ (coefficient of multiple determination), we would expect it to be better using sPLS than PLS.

# Sample Plots

We can represent the samples projected onto the sPLS components using the *plotIndiv()* function.

```
par(mfrow = c(1,3))
plotIndiv(liver.spls, comp = 1:2, rep.space= 'Y-variate', group = liver.toxicity$treatment[, 4],
ind.names = liver.toxicity$treatment[, 3],
legend = TRUE, title = 'Liver, sPLS comp 1 - 2, Y-space')
```

```
plotIndiv(liver.spls, comp = 1:2, rep.space= 'X-variate', group = liver.toxicity$treatment[, 4],
ind.names = liver.toxicity$treatment[, 3],
legend = TRUE, title = 'Liver, sPLS comp 1 - 2, X-space')
```

```
plotIndiv(liver.spls, comp = 1:2, rep.space= 'XY-variate', group = liver.toxicity$treatment[, 4],
ind.names = liver.toxicity$treatment[, 3],
legend = TRUE, title = 'Liver, sPLS comp 1 - 2, XY-space')
```

The *plotIndiv* outputs above highlights the patterns in the two data sets (gene and clinical). Individual plots can be displayed on three different subspaces spanned either by the X variable, the Y variable or the mean subspace in which coordinates are averaged from the first two subspaces (XY). This output shows that the time of necropsy has a larger effect than the acetaminophen doses the rat eat. Compared with individual variable PCA plots above, sPLS differentiates between the low and higher doses of acetaminophen and time of nercropsy on each dimension.

While sPLS is an unsupervised approach and does not take into account the classes of the samples in the model, the graphs above are colored according to the time of necropsy to better understand the similarities between samples.

The plots can also be represented in 3D using *style = 3D*.

```
col.tox <- color.mixo(as.numeric(as.factor(liver.toxicity$treatment[, 4])))
plotIndiv(liver.spls, ind.names = F, axes.box = "both", col = col.tox, style = '3d')
```

# Variable Plots

The variables selected by the sPLS are projected onto a correlation circle plot using *plotVar*. More information on how to interpret the correlation circle plots can be found in [4].

```
plotVar(liver.spls, comp =1:2,
var.names = list(X.label = liver.toxicity$gene.ID[,'geneBank'],
Y.label = TRUE), cex = c(4, 5))
```

**Relevance Networks** and **Clustered Image Map (CIM)** can also help in the interpretation of the results. The network takes the attribute *keep.var=TRUE* to display only the variables selected by sPLS:

```
# define red and green colors for the edges
color.edge <- color.GreenRed(50)
# to save as a pdf
network(liver.spls, comp = 1:2, shape.node = c("rectangle", "rectangle"),
color.node = c("white", "pink"), color.edge = color.edge, threshold = 0.8)
```

This network can be saved as a .glm for an input into Cytoscape, see here and [4].

The CIM also allows to visualize correlations between variables. Here we represent the variables selected on the 3 components, see [4].

```
cim(liver.spls, comp = 1:3, xlab = "clinic", ylab = "genes",
margins = c(7, 7))
```

# References

- Bushel, P.R., Wolfinger, R.D. and Gibson, G., 2007. Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Systems Biology, 1(1), p.1.
- Lê Cao, K.A., Rossouw, D., Robert-Granié, C. and Besse, P., 2008. A sparse PLS for variable selection when integrating omics data. Statistical applications in genetics and molecular biology, 7(1).
- Lê Cao K.-A., Martin P.G.P., Robert-Granié C. and Besse P. (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10(34).
- González, I., Lê Cao, K.A., Davis, M.J. and Déjean, S., 2012. Visualising associations between paired ‘omics’ data sets. BioData mining, 5(1), p.1.
- Tenenhaus, M. (1998). La régression PLS: théorie et pratique. Editions Technip.