Here, we illustrate PCA and sPLS-DA using the SRBCT data set, see ?srbct.
Load the latest version of mixOmics (check the version!).
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.2.0
##
## Visit http://www.mixOmics.org for more details about our methods.
## Any bug reports or comments? Notify us at mixomics at math.univ-toulouse.fr or https://bitbucket.org/klecao/package-mixomics/issues
##
## Thank you for using mixOmics!
The R script is provided with this tutorial, along with the RData to load some already-run results to save some time.
The data are directly available in a processed and normalised format from the package. The Small Round Blue Cell Tumors (SRBCT) dataset from [1] includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).
The srbct dataset contains the following:
$gene: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.
$class: a class vector containing the class tumor of each individual (4 classes in total).
$gene.name: a data frame with 2,308 rows and 2 columns containing further information on the genes.
data(srbct)
#the gene expression data
X = srbct$gene
dim(X)
## [1] 63 2308
# the tumour subtypes
summary(srbct$class)
## EWS BL NB RMS
## 23 8 12 20
A preliminary PCA analysis on the gene expression data allows a first exploration of the major sources of variation in the data. Recall that PCA is an unsupervised analysis where no information about the tumour classes is provided in the method. In order to understand the amount of variation explained, we set ncomp to a rather large number. In PCA, centering is a recommended transformation in most situations [2] and results in all genes with the same zero mean, allowing to focus on the differences between samples. Scaling generally aims to give similar weights to all genes in the analysis, since genes with high variance will be considered influential in PCA but are not necessarily of biological relevance.
pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE)
#pca.srbct #outputs the explained variance per component
plot(pca.srbct) # screeplot of the eingenvalues (explained variance per component)
The barplot above shows that two components are sufficient to explain most variance (or information) from the data. We represent (project) the data on the first two principal components:
plotIndiv(pca.srbct, ind.names = FALSE, pch = 16, title = 'PCA on SRBCT')
We may also wonder if the sample clusters we visualise are related to the tumour subtypes.
plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE,
legend = TRUE, title = 'PCA on SRBCT')
Here we observe that the major source of variation may not be explained by tumour subtypes, but some other drivers (the genes perhaps). This could be further pursued with sparse PCA that identifies the key variables driving the variation (see ?spca and ?selectVar and ?plotVar).
Note that since PCA is unsupervised, we only take into account the sample type information after the PCA, for visualisation purposes.
For a discriminant analysis, we set the factor Y that indicates the class membership of each sample.
Y = srbct$class
summary(Y) #outcome categories
## EWS BL NB RMS
## 23 8 12 20
A PLS-DA model is first fitted with five components to evaluate the pthe number of components necessary for the final model (see below).
Let’s have a look at the samples projected into the subspace spanned by the first two PLS-DA components.
srbct.plsda <- plsda(X, Y, ncomp = 5) # set ncomp to 5 for performance assessment later
plotIndiv(srbct.plsda , comp = 1:2,
group = srbct$class, ind.names = FALSE,
ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')
From the sample plot, we observe a clear separation of the four tumor classes compared to an unsupervised PCA sample plot. Confidence ellipses for each class are plotted to highlight the strength of the discrimination (confidence level set to 95% by default, see argument ellipse.level).
PLS-DA is a prediction model, meaning that we could classify new individuals’ tumour subtypes based on the model. A graphical to visualise where those new invididuals would ‘fall’ is possible. As detailed in our main article and supplemental information [3] the prediction area can be visualised by calculating a background surface first, before overlaying the sample plot. See ?background.predict for more details. The prediction distance needs to be specified:
# with background
# ~ takes 30 sec to run the simulations
background = background.predict(srbct.plsda, comp.predicted= 2, dist = "max.dist")
plotIndiv(srbct.plsda, comp = 1:2,
group = srbct$class, ind.names = FALSE, title = "Maximum distance",
legend = TRUE, background = background)
The classification performance of the PLS-DA model is assessed with the perf function using 5-fold cross-validation repeated 10 times. The number of repeats is necessary to ensure a good estimate of the classification error rate (as the CV-folds are determined in a random manner). From the performance results we can decide on the choice of the number of components of the final PLS model. The maximal number of components assessed corresponds to the number of components of the srbct.plsda model run above.
# takes a~ 1 minute to run
set.seed(2543) # for reproducibility, only when the `cpus' argument is not used
perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 5,
progressBar = FALSE, auc = TRUE, nrepeat = 10)
# perf.plsda.srbct$error.rate # error rates
plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
From the performance plot, we observe that the overall error rate and the Balanced Error Rate (BER) are similar, and sharply decrease from 1 to 3 components. The error rates then stabilise. The perf function outputs the optimal number of components in perf.plsda.srbct$choice.ncomp based on t-tests that test for a significant difference in the mean error rate between components. The result is output for each performance measure and distance, when applicable. The performance displayed suggests that ncomp = 4 with BER and the maximum distance is sufficient to achieve good performance (0.06 error rate).
An AUC plot can also be obtained using the function auroc, where the AUC is calculated from training cross-validation sets and averaged (see the perf function outputs for each component, perf.plsda.srbct\(auc** and **perf.plsda.srbct\)auc.all for one vs. one class or one vs. all classes). Note however that ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance as described in the main manuscript [3].
auc.plsda = auroc(srbct.plsda, roc.comp = 6)
The PLS-DA model is built on all genes in X, many of which may be uninformative to characterise the different classes. The sPLS-DA analysis aims to identify a small subset of genes that best discriminate the classes.
Refer to http://mixomics.org/case-studies/srbct-example/ on how to choose the number of variables to select in sPLS-DA. Here we just load the results.
load('RData/result-SRBCT-sPLSDA.Rdata')
The number of optimal components to choose from the tuning step:
ncomp <- tune.splsda.srbct$choice.ncomp$ncomp # optimal number of components based on t-tests
ncomp
## [1] 3
The number of features to select on each component from the tuning step:
select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp] # optimal number of variables to select
select.keepX
## comp1 comp2 comp3
## 9 280 30
Our final model includes 3 components and 9, 280, 30 selected variables on the first 3 components.
splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX)
The sample plots on the first three components (see below) show that the BL tumours are well separated on the first component, while the second component discriminated EWB from NB and RMS.
plotIndiv(splsda.srbct, comp = c(1,2),
group = srbct$class, ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
title = 'sPLS-DA on SRBCT, comp 1 & 2')
The addition of the third component further discriminates NB from RMS:
plotIndiv(splsda.srbct, comp = c(1,3),
group = srbct$class, ind.names = FALSE,
ellipse = TRUE, legend = TRUE,
title = 'sPLS-DA on SRBCT, comp 1 & 3')
An AUC plot can also be obtained (see example our our website).
The plotLoading function displays the loading weights, where colors indicate the class for which the selected variable has a maximal mean value. The most important selected genes are located at the bottom of the graph.
plotLoadings(splsda.srbct, comp = 1, title = 'Loadings on comp 1',
contrib = 'max', method = 'mean')
plotLoadings(splsda.srbct, comp = 2, title = 'Loadings on comp 2',
contrib = 'max', method = 'mean')
plotLoadings(splsda.srbct, comp = 3, title = 'Loadings on comp 3',
contrib = 'max', method = 'mean')
A Clustered Image Map including the final gene signature is plotted (default values to Euclidian distance and Complete linkage). The argument comp can be also be specified to highlight only the variables selected on specific components.
# if RStudio gives you trouble with window size, uncomment using the save argument
# cim(splsda.srbct, , save = 'jpeg', name.save = 'mycim')
# to visualise genes selected on comp 1 only
# cim(splsda.srbct, comp=1, title ="Component 1", save = 'jpeg', name.save = 'mycim-comp1')
The arrow plot displays all samples spanned by the sPLS-DA components, with arrows pointing towards the location of the \(Y\) category (in the Y-space).
plotArrow(splsda.srbct, legend=T)
Other useful variable outputs include plotVar to display the selected variables on a correlation circle plot (see also [4]).
If an external test set is available the predict function outputs the sPLS-DA predicted class membership for each test sample, see also an example from the Supplemental material regarding the prediction distances [3].
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mixOmics_6.2.0 ggplot2_2.2.1 lattice_0.20-34 MASS_7.3-45
## [5] knitr_1.16
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.11 RColorBrewer_1.1-2 plyr_1.8.4
## [4] tools_3.3.2 digest_0.6.12 jsonlite_1.4
## [7] evaluate_0.10 tibble_1.3.1 gtable_0.2.0
## [10] rlang_0.1.1 igraph_1.0.1 shiny_1.0.3
## [13] DBI_0.5-1 yaml_2.1.14 parallel_3.3.2
## [16] stringr_1.2.0 dplyr_0.5.0 htmlwidgets_0.8
## [19] rprojroot_1.1 grid_3.3.2 ellipse_0.3-8
## [22] R6_2.2.1 rgl_0.97.0 rmarkdown_1.2
## [25] reshape2_1.4.2 tidyr_0.6.3 corpcor_1.6.8
## [28] magrittr_1.5 backports_1.0.4 scales_0.4.1
## [31] htmltools_0.3.6 assertthat_0.2.0 mime_0.5
## [34] colorspace_1.3-2 xtable_1.8-2 httpuv_1.3.3
## [37] labeling_0.3 stringi_1.1.5 lazyeval_0.2.0
## [40] munsell_0.4.3
Svante Wold, Michael Sjöström, and Lennart Eriksson. Pls-regression: a basic tool of chemometrics. Chemometrics and intelligent laboratory systems, 58(2):109–130, 2001.