Case Study of sPLS-DA with SRBCT dataset

Partial Least Squares Discriminant Analysis (PLS-DA) is a linear, multivariate model which uses the PLS algorithm to allow classification of categorically labelled data. PLS-DA seeks for components that best separate the sample groups, whilst the sparse version also selects variables that best discriminate between groups.

To begin

Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.

library(mixOmics) # import the mixOmics library

set.seed(5249) # for reproducibility, remove for normal use

The data

The Small Round Blue Cell Tumours (SRBCT) dataset was generating by studying the expression levels of a set of genes to test for markers of certain tumour types (Khan et al, 2001).

The mixOmics SRBCT dataset is accessed via srbct and contains the following:

  • srbct$gene (continuous matrix): 63 rows and 2308 columns. The expression levels of the 2308 genes across the 63 tested subjects.

  • srbct$class (categorical vector): class vector of length 63. Contains the tumour class of each individual. There are four classes which include Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS).

  • srbct$gene.name (string vectors): two lists of length 2308. There is the Image.ID component containing an integer ID for each gene. The Gene.Description component briefly describes the nature of each gene.

To confirm the correct dataframe was extracted, the dimensions are checked. The distribution of class labels is also examined. It can be seen that these class labels are not balanced.

data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix

dim(X) # check the dimensions of the X dataframe
## [1]   63 2308
summary(Y) # check the distribution of class labels
## EWS  BL  NB RMS 
##  23   8  12  20

Initial Analysis

Preliminary Analysis with PCA

As in most cases when developing models, exploring the data to determine the major sources of variation is a good first step. PCA will be used for this. As described in the PCA Methods Page, centering and scaling is recommended to homogenize the variance across the genes. ncomp is set to an arbitrarily high number to understand the captured variance across cotheremponents.

# run pca method on data
pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE) 
plot(pca.srbct)  # barplot of the eigenvalues (explained variance per component)

FIGURE 1: Barplot of the variance each principal component explains of the SRBCT gene expression data.

Two components would be sufficient to explain a moderate proportion of the data’s variance according to Figure 1. Next, the data is projected onto these two components to attempt to observe sources of variation.

plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE, # plot the samples projected
          legend = TRUE, title = 'PCA on SRBCT, comp 1 - 2') # onto the PCA subspace

FIGURE 2: Preliminary (unsupervised) analysis with PCA on the SRBCT gene expression data

It seems that different tumour types do not separate or cluster across the two Principal components of the data, as seen in Figure 2. There are clusters, but these are not explained by the class variable. It can be inferred then that the major source of variation is not attributed to tumour type. Note that Figure 2 has each sample coloured by the class. This is only done for visualisation after the PCA as it is an unsupervised approach.

Initial sPLS-DA model

srbct.splsda <- splsda(X, Y, ncomp = 10)  # set ncomp to 10 for performance assessment later

A PLS-DA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model. A sample plot, including confidence ellipses, is shown in Figure 3(a). This plot shows much better clustering of samples according to the tumour type when compared to the PCA output.

The background.predict() function can also be utilised to depict the separation of class labels as seen in Figure 3(b). This plot provides intuition on how novel samples would be classified according to the model generated by sPLS-DA.

# plot the samples projected onto the first two components of the PLS-DA subspace
plotIndiv(srbct.splsda , comp = 1:2, 
          group = srbct$class, ind.names = FALSE,  # colour points by class
          ellipse = TRUE, # include 95% confidence ellipse for each class
          legend = TRUE, title = '(a) PLSDA with confidence ellipses')

# use the max.dist measure to form decision boundaries between classes based on PLS-DA data
background = background.predict(srbct.splsda, comp.predicted=2, dist = "max.dist")

# plot the samples projected onto the first two components of the PLS-DA subspace
plotIndiv(srbct.splsda, comp = 1:2,
          group = srbct$class, ind.names = FALSE, # colour points by class
          background = background, # include prediction background for each class
          legend = TRUE, title = " (b) PLSDA with prediction background")

FIGURE 3: Sample plots of the SRBCT gene expression data after a basic PLS-DA model was operated on this data. (a) depicts the samples with the confidence ellipses of different class labels while (b) depicts the prediction background generated by these samples. Both plots use the first two components as axes.

Tuning sPLS-DA

Selecting the number of components

The ncomp Parameter

The number of components to use is a crucial decision and is dictated by the performance of the PLS-DA model - i.e. its ability to correctly classify novel samples. The perf() function is used for this exactly. This is done with repeated cross-validation. Based on the output of this function, the optimal number of components to use can be identified.

A three-fold, 10 repeat cross-validation procedure is utilised here. Generally, for datasets with numerous samples, at least 10 folds is recommended. 3 or 5 folds is appropriate for smaller datasets and those with minimal samples should use Leave-One-Out (LOO) validation. Consider using 50-100 repeats to reduce the impact of the randomly allocated folds during each repeat.

The overall error rate (OER) and balanced error rate (BER) for the three different distance metrics (explained further below) across the first ten components are depicted in Figure 4.

# undergo performance evaluation in order to tune the number of components to use
perf.splsda.srbct <- perf(srbct.splsda, validation = "Mfold", 
                          folds = 5, nrepeat = 10, # use repeated cross-validation
                          progressBar = FALSE, auc = TRUE) # include AUC values

# plot the outcome of performance evaluation across all ten components
plot(perf.splsda.srbct, col = color.mixo(5:7), sd = TRUE,
     legend.position = "horizontal")

FIGURE 4: Tuning the number of components in PLS-DA on the SRBCT gene expression data. For each component, repeated cross-validation (10 × 3−fold CV) is used to evaluate the PLS-DA classification performance (OER and BER), for each type of prediction distance; max.dist, centroids.dist and mahalanobis.dist.

From this, it seems three components are appropriate as the error for each distance metric decreases by very incremental amounts after this. Components beyond the third are likely to provide neglible returns to the classification accuracy. A more empirical way to select this number is through the $choice.ncomp component of the perf() output object. It runs t-tests for a significant different in mean error rate across components. Using the max.dist metric, this suggests that the optimal number of components is 4. When to use each distance metric is explained further below.

perf.splsda.srbct$choice.ncomp # what is the optimal value of components according to perf()
##         max.dist centroids.dist mahalanobis.dist
## overall        4              4                7
## BER            4              4                7

Selecting the number of variables

The keepX Parameter

In order to determine the number of variables used to construct each latent component, the tune.splsda() function is utilised. This is performed iteratively, such that components are tuned one at a time. Through this function, the classification error rate can be extracted and averaged across folds and repeats. As mentioned above, an appropriate number of repeats would be around 50-100.

Up to four components were selected as per Figure 4. In this case, five-fold cross validation was repeated ten times using the BER of max.dist as the performance measure - where minimisation was optimal. The cpus parameter allows for the use of parallelisation of computation as tuning can take a long time on low-to-mid range processors.

The output of the tuning is shown in Figure 5. The diamond indicates the optimal number of variables to keep for a given component, selected by which keepX value achieves the lowest classification error rate as determined with a one-sided t−test. The error bars indicate the standard deviation across the repeated, cross-validated folds.

# grid of possible keepX values that will be tested for each component
list.keepX <- c(1:10,  seq(20, 300, 10))

# undergo the tuning process to determine the optimal number of variables
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 4, # calculate for first 4 components
                                 validation = 'Mfold',
                                 folds = 5, nrepeat = 10, # use repeated cross-validation
                                 dist = 'max.dist', # use max.dist measure
                                 measure = "BER", # use balanced error rate of dist measure
                                 test.keepX = list.keepX)

plot(tune.splsda.srbct, col = color.jet(4)) # plot output of variable number tuning

FIGURE 5: Tuning keepX for the sPLS-DA performed on the SRBCT gene expression data. Each coloured line represents the balanced error rate (y-axis) per component across all tested keepX values (x-axis) with the standard deviation based on the repeated cross-validation folds. As sPLS-DA is an iterative algorithm, values represented for a given component (e.g. comp 1 to 2) include the optimal keepX value chosen for the previous component (comp 1).

The above figure also aids in tuning the number of components. While the tuning of component number (through perf()) yielded an optimal value of 4, conflicting results can be seen after the use of tune.splsda(), such that the optimal value is claimed to be 3. After the optmisation of the selected features, the fourth component seemingly minimises the BER negligibly.

tune.splsda.srbct$choice.ncomp$ncomp # what is the optimal value of components according to tune.splsda()
## [1] 3

The exact quantity of features to use for each component can also be extracted from this object:

tune.splsda.srbct$choice.keepX # what are the optimal values of variables according to tune.splsda()
## comp1 comp2 comp3 comp4 
##    10   290   240    80

These values are stored to form the final, optimised model.

optimal.ncomp <- tune.splsda.srbct$choice.ncomp$ncomp
optimal.keepX <- tune.splsda.srbct$choice.keepX[1:optimal.ncomp]

Final Model

Using all the tuned parameters from above, the final sPLS-DA model can be formed.

# form final model with optimised values for component and variable count
final.splsda <- splsda(X, Y, 
                       ncomp = optimal.ncomp, 
                       keepX = optimal.keepX)

Plots

Sample Plots

Once again, sample plots will be used to show the distribution of the data in the latent space. The plots seen in Figure 6 contrast Figure 3 starkly. Figure 3 was yielded by an unoptimised, PLS-DA model (no feature or component selection). Figure 6 shows the sample plots for the first and second components (a) and the first and third components (b). The difference between Figure 6 (a) and (b) is indicative of the fact that different genes discriminate the samples differently. Genes which contributed to the third component separated the RMS and NB classes much better than those which contributed to the second. All three components were well suited to separate the BL class as it does not overlap any other cluster in either plot.

plotIndiv(final.splsda, comp = c(1,2), # plot samples from final model
          group = srbct$class, ind.names = FALSE, # colour by class label
          ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
          title = ' (a) sPLS-DA on SRBCT, comp 1 & 2')

plotIndiv(final.splsda, comp = c(1,3), # plot samples from final model
          group = srbct$class, ind.names = FALSE,  # colour by class label
          ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
          title = '(b) sPLS-DA on SRBCT, comp 1 & 3')

FIGURE 6: Sample plots from sPLS-DA performed on the SRBCT gene expression data including 95% confidence ellipses. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3. Samples are coloured by their tumour subtypes.

A Cluster Image Map (CIM) is shown in Figure 7. It depicts the expression levels of each gene (selected for component construction) for every sample. Euclidean distance with a complete agglomeration method were used to yield this CIM. It can be seen that certain sets of genes had homogeneous expression for different classes. For example, nearly half of the genes had high expression with the EWS (blue) tumour.

# set the styling of the legend to be homogeneous with previous plots
legend=list(legend = levels(Y), # set of classes
            col = unique(color.mixo(Y)), # set of colours
            title = "Tumour Type", # legend title
            cex = 0.7) # legend size

# generate the CIM, using the legend and colouring rows by each sample's class
cim <- cim(final.splsda, row.sideColors = color.mixo(Y), 
           legend = legend)
newplot

FIGURE 7: Clustered Image Map of the genes selected by sPLS-DA on the SRBCT gene expression data across all 3 components. A hierarchical clustering based on the gene expression levels of the selected genes, with samples in rows coloured according to their tumour subtype (using Euclidean distance with Complete agglomeration method).

Variable Plots

The stability of a given feature is defined as the proportion of cross validation folds (across repeats) where it was selected for to be used for a given component. Stability values can be extracted via perf.splsda.srbct$features$stable. These stabilities can be plotted, seen in Figure 8. Those with the highest stability are likely to be much more “important” for a given component. The features used for the first component had consistently lower stability than the other two. This can be explained as there are various combinations of genes that are discriminative on component 1, whereas the number of combinations decreases as component 2 is formed.

# form new perf() object which utilises the final model
perf.splsda.srbct <- perf(final.splsda, 
                          folds = 5, nrepeat = 10, # use repeated cross-validation
                          validation = "Mfold", dist = "max.dist",  # use max.dist measure
                          progressBar = FALSE)

# plot the stability of each feature for the first three components, 'h' type refers to histogram
par(mfrow=c(1,3))
plot(perf.splsda.srbct$features$stable[[1]], type = 'h', 
     ylab = 'Stability', 
     xlab = 'Features', 
     main = '(a) Comp 1', las =2)
plot(perf.splsda.srbct$features$stable[[2]], type = 'h', 
     ylab = 'Stability', 
     xlab = 'Features', 
     main = '(b) Comp 2', las =2)
plot(perf.splsda.srbct$features$stable[[3]], type = 'h', 
     ylab = 'Stability', 
     xlab = 'Features',
     main = '(c) Comp 3', las =2)

FIGURE 8: Stability of variable selection from the sPLS-DA on the SRBCT gene expression data. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 (a), 2 (b) and 3 (c).

Another variable plot to be used is the correlation circle plot. Figure 9 depicts this. By considering both the correlation circle plot and the sample plot (Figure 6(a)), a group of genes with a positive correlation with component 1 (EH domain, proteasome etc.) are observed to be associated with the BL samples. Two groups of genes are either positively or negatively correlated with component 2. These genes are likely to characterise either the NB and RMS classes, or the EWS class.

var.name.short <- substr(srbct$gene.name[, 2], 1, 10) # form simplified gene names

plotVar(final.splsda, comp = c(1,2), var.names = list(var.name.short), cex = 3) # generate correlation circle plot

FIGURE 9: Correlation circle plot representing the genes selected by sPLS-DA performed on the SRBCT gene expression data. Gene names are truncated to the first 10 characters. Only the genes selected by sPLS-DA are shown in components 1 and 2.

Prediction

When undergoing prediction, the (s)PLS-DA data must first be segmented into training and testing, such that there are novel samples to evaluate performance on. Otherwise, it runs the risk of “overfitting”, resulting in inflated predictive ability scores.

train <- sample(1:nrow(X), 50) # randomly select 50 samples in training
test <- setdiff(1:nrow(X), train) # rest is part of the test set

# store matrices into training and test set:
X.train <- X[train, ]
X.test <- X[test,]
Y.train <- Y[train]
Y.test <- Y[test]

A model is then trained on the training data. Note that the previously calculated optimal.keepX values are used here. In real scenarios, the training model should be tuned itself. It is crucial that when tuning the training model, it is done in the absence of the testing data. This also reduces likelihood of overfitting.

# train the model
train.splsda.srbct <- splsda(X.train, Y.train, ncomp = optimal.ncomp, keepX = optimal.keepX)

The model is then applied on the test set using a specific distance metric. In this case, the Mahalanobis distance was used (arbitrarily).

# use the model on the Xtest set
predict.splsda.srbct <- predict(train.splsda.srbct, X.test, 
                                dist = "mahalanobis.dist")

To evaluate the predictive performance, confusion matrices can be used. Directly below is such a matrix for a model using just the first two components. Only one misclassification were made - one sample was claimed to belong to the NB class when it truthfully belonged to the RMS class.

# evaluate the prediction accuracy for the first two components
predict.comp2 <- predict.splsda.srbct$class$mahalanobis.dist[,2]
table(factor(predict.comp2, levels = levels(Y)), Y.test)
##      Y.test
##       EWS BL NB RMS
##   EWS   3  0  0   0
##   BL    0  0  0   0
##   NB    2  0  3   1
##   RMS   0  0  0   4

Below is the equivalent matrix for the model using all three components. It can be seen that the classification accuracy increased.

# evaluate the prediction accuracy for the first three components
predict.comp3 <- predict.splsda.srbct$class$mahalanobis.dist[,3]
table(factor(predict.comp3, levels = levels(Y)), Y.test)
##      Y.test
##       EWS BL NB RMS
##   EWS   3  0  0   0
##   BL    0  0  0   0
##   NB    1  0  3   1
##   RMS   1  0  0   4

Performance Plots

AUC plots can be used for performance evaluation. AUC scores are calculated from training cross-validation sets and averaged in the perf() function (perf.plsda.srbct$auc and perf.plsda.srbct$auc.all for one vs. one class or one vs. all classes respectively).

However, ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance. AUROC curves use a cutoff that maximises specificity and sensitivity rather than this distance and hence should be used a merely a complementary tool.

AUROC plots for models containing one component and three components can be seen in Figure 10 (a) and (b) respectively. Figure 10(a) suggests that the sPLS-DA model can distinguish BL subjects from the other groups with a high true positive and low false positive rate, while the model is less well able to distinguish samples from other classes on component 1. The model including all three components (Figure 10(b)) has a perfect classification accuracy. While the model is definitely improved by the addition of two components, it is the small testing set size which allows for this perfect score.

Note that if print = TRUE (as is by default), numerical output including AUC and a Wilcoxon test p-value for each ‘one vs. other’ class comparisons that are performed per component will be printed.

auc.splsda = auroc(final.splsda, roc.comp = 1, print = FALSE) # AUROC for the first component

auc.splsda = auroc(final.splsda, roc.comp = 3, print = FALSE) # AUROC for all three components

FIGURE 10: ROC curve and AUC from sPLS-DA on the SRBCT gene expression data on component 1 (a) and all three components (b) averaged across one-vs.-all comparisons.