SRBCT example

Single omics supervised multivariate analyses

Here, we illustrate PCA and sPLS-DA using the SRBCT data set, see ?srbct.

Load the latest version of mixOmics (check the version!).



The R script is available at this link (R script and RData)


The data are directly available in a processed and normalised format from the package. The Small Round Blue Cell Tumors (SRBCT) dataset from [1] includes the expression levels of 2,308 genes measured on 63 samples. The samples are classified into four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).

The srbct dataset contains the following:

$gene: a data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.

$class: a class vector containing the class tumor of each individual (4 classes in total).

$ a data frame with 2,308 rows and 2 columns containing further information on the genes.

X = srbct$gene  #the gene expression data
## [1]   63 2308
##  23   8  12  20

Principal Component Analysis

A preliminary PCA analysis on the gene expression data allows a first exploration of the major sources of variation in the data. Recall that PCA is an unsupervised analysis where no information about the tumour classes is provided in the method. In order to understand the amount of variation explained, we set ncomp to a rather large number. In PCA, centering is a recommended transformation in most situations [2] and results in all genes with the same zero mean, allowing to focus on the differences between samples. Scaling generally aims to give similar weights to all genes in the analysis, since genes with high variance will be considered influential in PCA but are not necessarily of biological relevance.

pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE)
#pca.srbct #outputs the explained variance per component
plot(pca.srbct)  # screeplot of the eingenvalues (explained variance per component)

plot of chunk unnamed-chunk-3

The barplot above shows that two components are sufficient to explain most variance (or information) from the data.

In the following sample plot, samples are represented on the first two principal components and colored according to the tumour type. Here we observe that the major source of variation may not be explained by tumour types. Note that since PCA is unsupervised, we only take into account the sample type information after the PCA, for visualisation purposes.

plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE, 
          legend = TRUE, title = 'PCA on SRBCT')

plot of chunk unnamed-chunk-4

PLS-DA analysis

For a discriminant analysis, we set the factor Y that indicates the class membership of each sample. Inside the PLS-DA procedure the Y factor will be transformed into a dummy matrix.

Y = srbct$class 
summary(Y)        #outcome categories
##  23   8  12  20

A PLS-DA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model (see below).

##Sample plots
The samples are then projected into the subspace spanned by the first two components.

srbct.plsda <- plsda(X, Y, ncomp = 10)  # set ncomp to 10 for performance assessment later
plotIndiv(srbct.plsda , comp = 1:2,
          group = srbct$class, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE, title = 'PLSDA on SRBCT')

plot of chunk unnamed-chunk-6

From the sample plot, we observe a clear separation of the four tumor classes compared to an unsupervised PCA sample plot. Confidence ellipses for each class are plotted to highlight the strength of the discrimination (confidence level set to 95% by default, see argument ellipse.level).

As detailed in our main article and supplemental information [3] the prediction area can be visualised by calculating a background surface first, before overlaying the sample plot. See ?background.predict for more details. The prediction distance needs to be specified:

# with background
background = background.predict(srbct.plsda, comp.predicted=2, dist = "max.dist") 
#optional: xlim = c(-40,40), ylim = c(-30,30))

plotIndiv(srbct.plsda, comp = 1:2,
          group = srbct$class, ind.names = FALSE, title = "Maximum distance",
          legend = TRUE,  background = background)

plot of chunk unnamed-chunk-7


The classification performance of the PLS-DA model is assessed with the perf function using 5-fold cross-validation repeated 10 times. The number of repeats is necessary to ensure a good estimate of the classification error rate (as the CV-folds are determined in a random manner). From the performance results we can decide on the choice of the number of components of the final PLS model. The maximal number of components assessed corresponds to the number of components of the srbct.plsda model run above.

# takes a couple of minutes to run
set.seed(2543) # for reproducibility, only when the `cpus' argument is not used
perf.plsda.srbct <- perf(srbct.plsda, validation = "Mfold", folds = 5, 
                  progressBar = FALSE, auc = TRUE, nrepeat = 10) 
# perf.plsda.srbct$error.rate  # error rates
plot(perf.plsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")

plot of chunk unnamed-chunk-9

From the performance plot, we observe that the overall error rate and the Balanced Error Rate (BER) are similar, and sharply decrease from 1 to 3 components. The error rates stabilise after 6 components. The perf function outputs the optimal number of components in perf.plsda.srbct$choice.ncomp based on t-tests that test for a significant difference in the mean error rate between components. The result is output for each performance measure and distance, when applicable. The performance displayed suggests that *ncomp = * 6 with BER and the maximum distance is sufficient to achieve good performance (0.06 error rate).

An AUC plot can also be obtained using the function auroc, where the AUC is calculated from training cross-validation sets and averaged (see the perf function outputs for each component, perf.plsda.srbct$auc and perf.plsda.srbct$auc.all for one vs. one class or one vs. all classes). Note however that ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance as described in the main manuscript [3].

auc.plsda = auroc(srbct.plsda, roc.comp = 6)

plot of chunk unnamed-chunk-10

#sPLS-DA analysis
The PLS-DA model is built on all genes in X, many of which may be uninformative to characterise the different classes. The sPLS-DA analysis aims to identify a small subset of genes that best discriminate the classes.

Tuning sPLS-DA

We first estimate the classification performance (error rate) with respect to the number of selected variables in the model with the function tune.splsda. The tuning is performed one component at a time and we set a maximum of ncomp = 6 as suggested from the PLS-DA performance assessment. We chose 5-fold cross validation (folds = 5) repeated 10 times, and specify a prediction distance (maximal distance) to predict class membership across all CV runs.

The code below may takes a few minutes to run on a mid-range laptop over a grid of keepX values that indicate the number of variables to select on each component. Some warnings may appear for some of the runs, as we set a stringent tolerance threshold in the algorithm. The argument cpus can be used to run the code in parallel and speed up computational time.

Here we have saved the results into a RData object that is available for download. The tuning takes about 30 min to run.

# grid of possible keepX values that will be tested for each component
list.keepX <- c(1:10,  seq(20, 300, 10))

tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 6, validation = 'Mfold', folds = 5, 
                           progressBar = TRUE, dist = 'max.dist', measure = "BER",
                          test.keepX = list.keepX, nrepeat = 10, cpus = 2)
error <- tune.splsda.srbct$error.rate  # error rate per component for the keepX grid

The number of optimal components to choose is returned in **$choice.ncomp$ncomp}:

ncomp <- tune.splsda.srbct$choice.ncomp$ncomp # optimal number of components based on t-tests
## [1] 3

The number of features to select on each component is output in $choice.keepX:

select.keepX <- tune.splsda.srbct$choice.keepX[1:ncomp]  # optimal number of variables to select
## comp1 comp2 comp3 
##     9   280    30

The classification error rates for each component conditional on the last component are represented below, for all components specified in the tune function.

plot(tune.splsda.srbct, col = color.jet(6))

plot of chunk unnamed-chunk-17

The classification error rate decreases when more components are included in sPLS-DA (the lower the prediction accuracy the better). The optimal number of variables to select that led to the best performance for each component is indicated as a diamond. A number of 3 components is sufficient for our final sPLS-DA model to reach optimal performance.

##Final model and sample representation
Our final model includes 3 components and 9, 280, 30 selected variables on the first 3 components.

splsda.srbct <- splsda(X, Y, ncomp = ncomp, keepX = select.keepX) 

The sample plots on the first three components (see below) show that the BL tumours are well separated on the first component, while the second component discriminated EWB from NB and RMS.

plotIndiv(splsda.srbct, comp = c(1,2),
          group = srbct$class, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE,
          title = 'sPLS-DA on SRBCT, comp 1 & 2')

plot of chunk unnamed-chunk-19

The addition of the third component further discriminates NB from RMS:

plotIndiv(splsda.srbct, comp = c(1,3),
          group = srbct$class, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE,
          title = 'sPLS-DA on SRBCT, comp 1 & 3')

plot of chunk unnamed-chunk-20

An AUC plot can also be obtained using the function **auroc}, as for the PLS-DA analysis.
The first AUROC includes 2 components only:

auc.splsda = auroc(splsda.srbct, roc.comp = 2)