Single omics supervised multivariate analyses

Here, we illustrate PCA and sPLS-DA using the SRBCT data set, see ?srbct.

library(mixOmics)


Rscript

The R script is available at this link

Data

The data are directly available in a processed and normalised format from the package. The Small Round Blue Cell Tumors (SRBCT) dataset from [1] includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).

The srbct dataset contains the following:

$gene: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.$class: a class vector containing the class tumor of each individual (4 classes in total).

$gene.name: a data frame with 2,308 rows and 2 columns containing further information on the genes. data(srbct) X = srbct$gene  #the gene expression data
dim(X)

## [1]   63 2308

summary(srbct$class)  ## EWS BL NB RMS ## 23 8 12 20  Principal Component Analysis A preliminary PCA analysis on the gene expression data allows a first exploration of the major sources of variation in the data. Recall that PCA is an unsupervised analysis where no information about the tumour classes is provided in the method. In order to understand the amount of variation explained, we set ncomp to a rather large number. In PCA, centering is a recommended transformation in most situations [2] and results in all genes with the same zero mean, allowing to focus on the differences between samples. Scaling generally aims to give similar weights to all genes in the analysis, since genes with high variance will be considered influential in PCA but are not necessarily of biological relevance. pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE) #pca.srbct #outputs the explained variance per component plot(pca.srbct) # screeplot of the eingenvalues (explained variance per component)  The barplot above shows that two components are sufficient to explain most variance (or information) from the data. In the following sample plot, samples are represented on the first two principal components and colored according to the tumour type. Here we observe that the major source of variation may not be explained by tumour types. Note that since PCA is unsupervised, we only take into account the sample type information after the PCA, for visualisation purposes. plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE,
legend = TRUE, title = 'PCA on SRBCT')


PLS-DA analysis

For a discriminant analysis, we set the factor Y that indicates the class membership of each sample. Inside the PLS-DA procedure the Y factor will be transformed into a dummy matrix.

Y = srbct$class summary(Y) #outcome categories  ## EWS BL NB RMS ## 23 8 12 20  A PLS-DA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model (see below). ##Sample plots The samples are then projected into the subspace spanned by the first two components. srbct.plsda <- plsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later plotIndiv(srbct.plsda , comp = 1:2, group = srbct$class, ind.names = FALSE,
ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')


From the sample plot, we observe a clear separation of the four tumor classes compared to an unsupervised PCA sample plot. Confidence ellipses for each class are plotted to highlight the strength of the discrimination (confidence level set to 95% by default, see argument ellipse.level).

As detailed in our main article and supplemental information [3] the prediction area can be visualised by calculating a background surface first, before overlaying the sample plot. See ?background.predict for more details. The prediction distance needs to be specified:

# with background
background = background.predict(srbct.plsda, comp.predicted=2, dist = "max.dist")
#optional: xlim = c(-40,40), ylim = c(-30,30))

plotIndiv(srbct.plsda, comp = 1:2,
group = srbct$class, ind.names = FALSE, title = "Maximum distance", legend = TRUE, background = background)  Performance The classification performance of the PLS-DA model is assessed with the perf function using 5-fold cross-validation repeated 10 times. The number of repeats is necessary to ensure a good estimate of the classification error rate (as the CV-folds are determined in a random manner). From the performance results we can decide on the choice of the number of components of the final PLS model. The maximal number of components assessed corresponds to the number of components of the srbct.plsda model run above. # takes a couple of minutes to run set.seed(2543) # for reproducibility, only when the cpus' argument is not used perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 5, progressBar = FALSE, auc = TRUE, nrepeat = 10)  # perf.plsda.srbct$error.rate  # error rates
plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")


From the performance plot, we observe that the overall error rate and the Balanced Error Rate (BER) are similar, and sharply decrease from 1 to 3 components. The error rates stabilise after 6 components. The perf function outputs the optimal number of components in perf.plsda.srbct$choice.ncomp based on t-tests that test for a significant difference in the mean error rate between components. The result is output for each performance measure and distance, when applicable. The performance displayed suggests that *ncomp = * 6 with BER and the maximum distance is sufficient to achieve good performance (0.06 error rate). An AUC plot can also be obtained using the function auroc, where the AUC is calculated from training cross-validation sets and averaged (see the perf function outputs for each component, perf.plsda.srbct$auc and perf.plsda.srbct$auc.all for one vs. one class or one vs. all classes). Note however that ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance as described in the main manuscript [3]. auc.plsda = auroc(srbct.plsda, roc.comp = 6)  #sPLS-DA analysis The PLS-DA model is built on all genes in X, many of which may be uninformative to characterise the different classes. The sPLS-DA analysis aims to identify a small subset of genes that best discriminate the classes. Tuning sPLS-DA We first estimate the classification performance (error rate) with respect to the number of selected variables in the model with the function tune.splsda. The tuning is performed one component at a time and we set a maximum of ncomp = 6 as suggested from the PLS-DA performance assessment. We chose 5-fold cross validation (folds = 5) repeated 10 times, and specify a prediction distance (maximal distance) to predict class membership across all CV runs. The code below may takes a few minutes to run on a mid-range laptop over a grid of keepX values that indicate the number of variables to select on each component. Some warnings may appear for some of the runs, as we set a stringent tolerance threshold in the algorithm. The argument cpus can be used to run the code in parallel and speed up computational time. Here we have saved the results into a RData object that is available for download. The tuning takes about 30 min to run. # grid of possible keepX values that will be tested for each component list.keepX <- c(1:10, seq(20, 300, 10)) tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 6, validation = 'Mfold', folds = 5, progressBar = TRUE, dist = 'max.dist', measure = "BER", test.keepX = list.keepX, nrepeat = 10, cpus = 2)  error <- tune.splsda.srbct$error.rate  # error rate per component for the keepX grid


The number of optimal components to choose is returned in **$choice.ncomp$ncomp}:

ncomp <- tune.splsda.srbct$choice.ncomp$ncomp # optimal number of components based on t-tests
ncomp

## [1] 3


The number of features to select on each component is output in $choice.keepX: select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]  # optimal number of variables to select
select.keepX

## comp1 comp2 comp3
##     9   280    30


The classification error rates for each component conditional on the last component are represented below, for all components specified in the tune function.

plot(tune.splsda.srbct, col = color.jet(6))


The classification error rate decreases when more components are included in sPLS-DA (the lower the prediction accuracy the better). The optimal number of variables to select that led to the best performance for each component is indicated as a diamond. A number of 3 components is sufficient for our final sPLS-DA model to reach optimal performance.

##Final model and sample representation
Our final model includes 3 components and 9, 280, 30 selected variables on the first 3 components.

splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX)


The sample plots on the first three components (see below) show that the BL tumours are well separated on the first component, while the second component discriminated EWB from NB and RMS.

plotIndiv(splsda.srbct, comp = c(1,2),
group = srbct$class, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'sPLS-DA on SRBCT, comp 1 & 2')  The addition of the third component further discriminates NB from RMS: plotIndiv(splsda.srbct, comp = c(1,3), group = srbct$class, ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
title = 'sPLS-DA on SRBCT, comp 1 & 3')


An AUC plot can also be obtained using the function **auroc}, as for the PLS-DA analysis.
The first AUROC includes 2 components only:

auc.splsda = auroc(splsda.srbct, roc.comp = 2)


The AUROC including all components from our final model led to a perfect discrimination. Refer to [3] for the interpretation of such output as the ROC and AUC criteria are not particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis.

auc.splsda = auroc(splsda.srbct, roc.comp = ncomp)


Performance assessment

The classification performance of the final sPLS-DA model is assessed with the perf function by specifying a prediction distance.

set.seed(40) # for reproducibility, only when the cpus' argument is not used
# takes about 1 min to run
perf.srbct <- perf(splsda.srbct, validation = "Mfold", folds = 5,
dist = 'max.dist', nrepeat = 10,
progressBar = FALSE)


We observe that the overall and Balanced Error Rate (BER) per class decrease as more components (3) are added in the model.

# perf.srbct  # lists the different outputs
perf.srbct$error.rate  ##$overall
##          max.dist
## comp 1 0.43174603
## comp 2 0.21904762
## comp 3 0.00952381
##
## $BER ## max.dist ## comp 1 0.521431159 ## comp 2 0.253768116 ## comp 3 0.008559783  plot(perf.srbct, col = color.mixo(5))  We can also examine the stability of the variables selected across the different cross-validation folds. Each variable's stability that is selected across the CV runs is represented with a vertical bar. We often observe a decrease in stability when more components are added in the model. par(mfrow=c(1,3)) plot(perf.srbct$features$stable[[1]], type = 'h', ylab = 'Stability', xlab = 'Features', main = 'Comp 1', las =2) plot(perf.srbct$features$stable[[2]], type = 'h', ylab = 'Stability', xlab = 'Features', main = 'Comp 2', las =2) plot(perf.srbct$features$stable[[3]], type = 'h', ylab = 'Stability', xlab = 'Features', main = 'Comp 3', las =2)  par(mfrow=c(1,1))  The selectVar function outputs the selected variables along with their loading weight value, from the most important to the least important variable. The code below also extracts the stability of the selected variables in order to gauge our confidence in the signature. For example on component 1, we observe a low stability of the selected variables: # here we match the selected variables to the stable features ind.match = match(selectVar(splsda.srbct, comp = 1)$name,
names(perf.srbct$features$stable[[1]]))
#extract the frequency of selection of those selected variables
Freq = as.numeric(perf.srbct$features$stable[[1]][ind.match])

data.frame(selectVar(splsda.srbct, comp = 1)\$value, Freq)

##        value.var Freq
## g123  0.64922048 0.48
## g846  0.44828969 0.46
## g1606 0.30641602 0.44
## g335  0.30031495 0.44
## g836  0.26392532 0.48
## g783  0.22022625 0.32
## g758  0.22019689 0.40
## g1386 0.16248889 0.36
## g585  0.02058632 0.16


Other graphical outputs

plotLoadings(splsda.srbct, comp = 1, title = 'Loadings on comp 1',
contrib = 'max', method = 'mean')


plotLoadings(splsda.srbct, comp = 2, title = 'Loadings on comp 2',
contrib = 'max', method = 'mean')


plotLoadings(splsda.srbct, comp = 3, title = 'Loadings on comp 3',
contrib = 'max', method = 'mean')


The color assigned can be changed to indicate the class with the minimal mean value:

plotLoadings(splsda.srbct, comp = 1, title = 'Loadings on comp 1',
contrib = 'min', method = 'mean')


A Clustered Image Map including the final gene signature is plotted (default values to Euclidian distance and Complete linkage). The argument comp can be also be specified to highlight only the variables selected on specific components.

cim(splsda.srbct)


cim(splsda.srbct, comp=1, title ="Component 1")


The arrow plot displays all samples spanned by the sPLS-DA components, with arrows pointing towards the location of the $Y$ category (in the Y-space).

plotArrow(splsda.srbct, legend=T)


Other useful variable outputs include plotVar to display the selected variables on a correlation circle plot (see also [4]).

Other outputs

If an external test set is available the predict function outputs the sPLS-DA predicted class membership for each test sample, see also an example from the Supplemental material regarding the prediction distances [3].

Session information of this Sweave code

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] mixOmics_6.3.0     ggplot2_2.2.1.9000 lattice_0.20-35
## [4] MASS_7.3-47        knitr_1.17         RWordPress_0.2-3
##
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.0   purrr_0.2.3        reshape2_1.4.2
##  [4] colorspace_1.3-2   htmltools_0.3.6    yaml_2.1.14
##  [7] XML_3.98-1.9       rlang_0.1.2        glue_1.1.1
## [10] RColorBrewer_1.1-2 bindrcpp_0.2       matrixStats_0.52.2
## [13] plyr_1.8.4         bindr_0.1          stringr_1.2.0
## [16] munsell_0.4.3      gtable_0.2.0       htmlwidgets_0.9
## [19] evaluate_0.10.1    labeling_0.3       httpuv_1.3.5
## [22] parallel_3.3.2     markdown_0.8       XMLRPC_0.3-0
## [25] highr_0.6          rARPACK_0.11-0     Rcpp_0.12.13
## [28] xtable_1.8-2       corpcor_1.6.9      scales_0.5.0.9000
## [31] backports_1.1.0    jsonlite_1.5       mime_0.5
## [34] RSpectra_0.12-0    gridExtra_2.3      ellipse_0.3-8
## [37] digest_0.6.12      stringi_1.1.5      dplyr_0.7.4
## [40] shiny_1.0.5        grid_3.3.2         rprojroot_1.2
## [43] tools_3.3.2        bitops_1.0-6       magrittr_1.5
## [46] rgl_0.98.1         lazyeval_0.2.0     RCurl_1.95-4.8
## [49] tibble_1.3.4       tidyr_0.7.1        pkgconfig_2.0.1
## [52] Matrix_1.2-11      assertthat_0.2.0   rmarkdown_1.6
## [55] R6_2.2.2           igraph_1.1.2