Case Study of sPLS-DA with SRBCT dataset
Partial Least Squares Discriminant Analysis (PLS-DA) is a linear, multivariate model which uses the PLS algorithm to allow classification of categorically labelled data. PLS-DA seeks for components that best separate the sample groups, whilst the sparse version also selects variables that best discriminate between groups.
For background information on the (s)PLS-DA method, refer to the PLS-DA Methods Page.
The R script used for all the analysis in this case study is available here.
Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library set.seed(5249) # for reproducibility, remove for normal use
The Small Round Blue Cell Tumours (SRBCT) dataset was generating by studying the expression levels of a set of genes to test for markers of certain tumour types (Khan et al, 2001).
mixOmics SRBCT dataset is accessed via
srbct and contains the following:
srbct$gene(continuous matrix): 63 rows and 2308 columns. The expression levels of the 2308 genes across the 63 tested subjects.
srbct$class(categorical vector): class vector of length 63. Contains the tumour class of each individual. There are four classes which include Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS).
srbct$gene.name(string vectors): two lists of length 2308. There is the
Image.IDcomponent containing an integer ID for each gene. The
Gene.Descriptioncomponent briefly describes the nature of each gene.
To confirm the correct dataframe was extracted, the dimensions are checked. The distribution of class labels is also examined. It can be seen that these class labels are not balanced.
data(srbct) # extract the small round bull cell tumour data X <- srbct$gene # use the gene expression data as the X matrix Y <- srbct$class # use the class data as the Y matrix dim(X) # check the dimensions of the X dataframe
##  63 2308
summary(Y) # check the distribution of class labels
## EWS BL NB RMS ## 23 8 12 20
Preliminary Analysis with PCA
As in most cases when developing models, exploring the data to determine the major sources of variation is a good first step. PCA will be used for this. As described in the PCA Methods Page, centering and scaling is recommended to homogenize the variance across the genes.
ncomp is set to an arbitrarily high number to understand the captured variance across cotheremponents.
# run pca method on data pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE) plot(pca.srbct) # barplot of the eigenvalues (explained variance per component)
FIGURE 1: Barplot of the variance each principal component explains of the SRBCT gene expression data.
Two components would be sufficient to explain a moderate proportion of the data's variance according to Figure 1. Next, the data is projected onto these two components to attempt to observe sources of variation.
plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE, # plot the samples projected legend = TRUE, title = 'PCA on SRBCT, comp 1 - 2') # onto the PCA subspace
FIGURE 2: Preliminary (unsupervised) analysis with PCA on the SRBCT gene expression data
It seems that different tumour types do not separate or cluster across the two Principal components of the data, as seen in Figure 2. There are clusters, but these are not explained by the class variable. It can be inferred then that the major source of variation is not attributed to tumour type. Note that Figure 2 has each sample coloured by the class. This is only done for visualisation after the PCA as it is an unsupervised approach.
Initial sPLS-DA model
srbct.splsda <- splsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later
A PLS-DA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model. A sample plot, including confidence ellipses, is shown in Figure 3(a). This plot shows much better clustering of samples according to the tumour type when compared to the PCA output.
background.predict() function can also be utilised to depict the separation of class labels as seen in Figure 3(b). This plot provides intuition on how novel samples would be classified according to the model generated by sPLS-DA.
# plot the samples projected onto the first two components of the PLS-DA subspace plotIndiv(srbct.splsda , comp = 1:2, group = srbct$class, ind.names = FALSE, # colour points by class ellipse = TRUE, # include 95% confidence ellipse for each class legend = TRUE, title = '(a) PLSDA with confidence ellipses') # use the max.dist measure to form decision boundaries between classes based on PLS-DA data background = background.predict(srbct.splsda, comp.predicted=2, dist = "max.dist") # plot the samples projected onto the first two components of the PLS-DA subspace plotIndiv(srbct.splsda, comp = 1:2, group = srbct$class, ind.names = FALSE, # colour points by class background = background, # include prediction background for each class legend = TRUE, title = " (b) PLSDA with prediction background")
FIGURE 3: Sample plots of the SRBCT gene expression data after a basic PLS-DA model was operated on this data. (a) depicts the samples with the confidence ellipses of different class labels while (b) depicts the prediction background generated by these samples. Both plots use the first two components as axes.
Selecting the number of components
The ncomp Parameter
The number of components to use is a crucial decision and is dictated by the performance of the PLS-DA model – i.e. its ability to correctly classify novel samples. The
perf() function is used for this exactly. This is done with repeated cross-validation. Based on the output of this function, the optimal number of components to use can be identified.
A three-fold, 10 repeat cross-validation procedure is utilised here. Generally, for datasets with numerous samples, at least 10 folds is recommended. 3 or 5 folds is appropriate for smaller datasets and those with minimal samples should use Leave-One-Out (LOO) validation. Consider using 50-100 repeats to reduce the impact of the randomly allocated folds during each repeat.
The overall error rate (OER) and balanced error rate (BER) for the three different distance metrics (explained further below) across the first ten components are depicted in Figure 4.
# undergo performance evaluation in order to tune the number of components to use perf.splsda.srbct <- perf(srbct.splsda, validation = "Mfold", folds = 5, nrepeat = 10, # use repeated cross-validation progressBar = FALSE, auc = TRUE) # include AUC values # plot the outcome of performance evaluation across all ten components plot(perf.splsda.srbct, col = color.mixo(5:7), sd = TRUE, legend.position = "horizontal")
FIGURE 4: Tuning the number of components in PLS-DA on the SRBCT gene expression data. For each component, repeated cross-validation (10 × 3−fold CV) is used to evaluate the PLS-DA classification performance (OER and BER), for each type of prediction distance; max.dist, centroids.dist and mahalanobis.dist.
From this, it seems three components are appropriate as the error for each distance metric decreases by very incremental amounts after this. Components beyond the third are likely to provide neglible returns to the classification accuracy. A more empirical way to select this number is through the
$choice.ncomp component of the
perf() output object. It runs t-tests for a significant different in mean error rate across components. Using the
max.dist metric, this suggests that the optimal number of components is 4. When to use each distance metric is explained further below.
perf.splsda.srbct$choice.ncomp # what is the optimal value of components according to perf()
## max.dist centroids.dist mahalanobis.dist ## overall 4 5 7 ## BER 4 5 7
Selecting the number of variables
The keepX Parameter
In order to determine the number of variables used to construct each latent component, the
tune.splsda() function is utilised. This is performed iteratively, such that components are tuned one at a time. Through this function, the classification error rate can be extracted and averaged across folds and repeats. As mentioned above, an appropriate number of repeats would be around 50-100.
Up to four components were selected as per Figure 4. In this case, five-fold cross validation was repeated ten times using the BER of
max.dist as the performance measure – where minimisation was optimal. The
cpus parameter allows for the use of parallelisation of computation as tuning can take a long time on low-to-mid range processors.
The output of the tuning is shown in Figure 5. The diamond indicates the optimal number of variables to keep for a given component, selected by which keepX value achieves the lowest classification error rate as determined with a one-sided t−test. The error bars indicate the standard deviation across the repeated, cross-validated folds.
# grid of possible keepX values that will be tested for each component list.keepX <- c(1:10, seq(20, 300, 10)) # undergo the tuning process to determine the optimal number of variables tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 4, # calculate for first 4 components validation = 'Mfold', folds = 5, nrepeat = 10, # use repeated cross-validation dist = 'max.dist', # use max.dist measure measure = "BER", # use balanced error rate of dist measure test.keepX = list.keepX, cpus = 2) # allow for paralleliation to decrease runtime plot(tune.splsda.srbct, col = color.jet(4)) # plot output of variable number tuning
FIGURE 5: Tuning keepX for the sPLS-DA performed on the SRBCT gene expression data. Each coloured line represents the balanced error rate (y-axis) per component across all tested keepX values (x-axis) with the standard deviation based on the repeated cross-validation folds. As sPLS-DA is an iterative algorithm, values represented for a given component (e.g. comp 1 to 2) include the optimal keepX value chosen for the previous component (comp 1).
The above figure also aids in tuning the number of components. While the tuning of component number (through
perf()) yielded an optimal value of 4, conflicting results can be seen after the use of
tune.splsda(), such that the optimal value is claimed to be 3. After the optmisation of the selected features, the fourth component seemingly minimises the BER negligibly.
tune.splsda.srbct$choice.ncomp$ncomp # what is the optimal value of components according to tune.splsda()
##  3
The exact quantity of features to use for each component can also be extracted from this object:
tune.splsda.srbct$choice.keepX # what are the optimal values of variables according to tune.splsda()
## comp1 comp2 comp3 comp4 ## 9 260 30 10
These values are stored to form the final, optimised model.
optimal.ncomp <- tune.splsda.srbct$choice.ncomp$ncomp optimal.keepX <- tune.splsda.srbct$choice.keepX[1:optimal.ncomp]
Using all the tuned parameters from above, the final sPLS-DA model can be formed.
# form final model with optimised values for component and variable count final.splsda <- splsda(X, Y, ncomp = optimal.ncomp, keepX = optimal.keepX)
Once again, sample plots will be used to show the distribution of the data in the latent space. The plots seen in Figure 6 contrast Figure 3 starkly. Figure 3 was yielded by an unoptimised, PLS-DA model (no feature or component selection). Figure 6 shows the sample plots for the first and second components (a) and the first and third components (b). The difference between Figure 6 (a) and (b) is indicative of the fact that different genes discriminate the samples differently. Genes which contributed to the third component separated the
NB classes much better than those which contributed to the second. All three components were well suited to separate the
BL class as it does not overlap any other cluster in either plot.
plotIndiv(final.splsda, comp = c(1,2), # plot samples from final model group = srbct$class, ind.names = FALSE, # colour by class label ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse title = ' (a) sPLS-DA on SRBCT, comp 1 & 2') plotIndiv(final.splsda, comp = c(1,3), # plot samples from final model group = srbct$class, ind.names = FALSE, # colour by class label ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse title = '(b) sPLS-DA on SRBCT, comp 1 & 3')
FIGURE 6: Sample plots from sPLS-DA performed on the SRBCT gene expression data including 95% confidence ellipses. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3. Samples are coloured by their tumour subtypes.
A Cluster Image Map (CIM) is shown in Figure 7. It depicts the expression levels of each gene (selected for component construction) for every sample. Euclidean distance with a complete agglomeration method were used to yield this CIM. It can be seen that certain sets of genes had homogeneous expression for different classes. For example, nearly half of the genes had high expression with the
EWS (blue) tumour.
# set the styling of the legend to be homogeneous with previous plots legend=list(legend = levels(Y), # set of classes col = unique(color.mixo(Y)), # set of colours title = "Tumour Type", # legend title cex = 0.7) # legend size # generate the CIM, using the legend and colouring rows by each sample's class cim <- cim(final.splsda, row.sideColors = color.mixo(Y), legend = legend)
FIGURE 7: Clustered Image Map of the genes selected by sPLS-DA on the SRBCT gene expression data across all 3 components. A hierarchical clustering based on the gene expression levels of the selected genes, with samples in rows coloured according to their tumour subtype (using Euclidean distance with Complete agglomeration method).
The stability of a given feature is defined as the proportion of cross validation folds (across repeats) where it was selected for to be used for a given component. Stability values can be extracted via
perf.splsda.srbct$features$stable. These stabilities can be plotted, seen in Figure 8. Those with the highest stability are likely to be much more “important” for a given component. The features used for the first component had consistently lower stability than the other two. This can be explained as there are various combinations of genes that are discriminative on component 1, whereas the number of combinations decreases as component 2 is formed.
# form new perf() object which utilises the final model perf.splsda.srbct <- perf(final.splsda, folds = 5, nrepeat = 10, # use repeated cross-validation validation = "Mfold", dist = "max.dist", # use max.dist measure progressBar = FALSE) # plot the stability of each feature for the first three components, 'h' type refers to histogram par(mfrow=c(1,3)) plot(perf.splsda.srbct$features$stable[], type = 'h', ylab = 'Stability', xlab = 'Features', main = '(a) Comp 1', las =2) plot(perf.splsda.srbct$features$stable[], type = 'h', ylab = 'Stability', xlab = 'Features', main = '(b) Comp 2', las =2) plot(perf.splsda.srbct$features$stable[], type = 'h', ylab = 'Stability', xlab = 'Features', main = '(c) Comp 3', las =2)
FIGURE 8: Stability of variable selection from the sPLS-DA on the SRBCT gene expression data. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 (a), 2 (b) and 3 ©.
Another variable plot to be used is the correlation circle plot. Figure 9 depicts this. By considering both the correlation circle plot and the sample plot (Figure 6(a)), a group of genes with a positive correlation with component 1 (
proteasome etc.) are observed to be associated with the
BL samples. Two groups of genes are either positively or negatively correlated with component 2. These genes are likely to characterise either the
RMS classes, or the
var.name.short <- substr(srbct$gene.name[, 2], 1, 10) # form simplified gene names plotVar(final.splsda, comp = c(1,2), var.names = list(var.name.short), cex = 3) # generate correlation circle plot
FIGURE 9: Correlation circle plot representing the genes selected by sPLS-DA performed on the SRBCT gene expression data. Gene names are truncated to the first 10 characters. Only the genes selected by sPLS-DA are shown in components 1 and 2.
When undergoing prediction, the (s)PLS-DA data must first be segmented into training and testing, such that there are novel samples to evaluate performance on. Otherwise, it runs the risk of “overfitting”, resulting in inflated predictive ability scores.
train <- sample(1:nrow(X), 50) # randomly select 50 samples in training test <- setdiff(1:nrow(X), train) # rest is part of the test set # store matrices into training and test set: X.train <- X[train, ] X.test <- X[test,] Y.train <- Y[train] Y.test <- Y[test]
A model is then trained on the training data. Note that the previously calculated
optimal.keepX values are used here. In real scenarios, the training model should be tuned itself. It is crucial that when tuning the training model, it is done in the absence of the testing data. This also reduces likelihood of overfitting.
# train the model train.splsda.srbct <- splsda(X.train, Y.train, ncomp = optimal.ncomp, keepX = optimal.keepX)
The model is then applied on the test set using a specific distance metric. In this case, the Mahalanobis distance was used (arbitrarily).
# use the model on the Xtest set predict.splsda.srbct <- predict(train.splsda.srbct, X.test, dist = "mahalanobis.dist")
To evaluate the predictive performance, confusion matrices can be used. Directly below is such a matrix for a model using just the first two components. Only one misclassification were made – one sample was claimed to belong to the
NB class when it truthfully belonged to the
# evaluate the prediction accuracy for the first two components predict.comp2 <- predict.splsda.srbct$class$mahalanobis.dist[,2] table(factor(predict.comp2, levels = levels(Y)), Y.test)
## Y.test ## EWS BL NB RMS ## EWS 5 0 0 0 ## BL 0 1 0 0 ## NB 1 0 4 1 ## RMS 0 0 0 1
Below is the equivalent matrix for the model using all three components. It can be seen that the classification accuracy increased.
# evaluate the prediction accuracy for the first three components predict.comp3 <- predict.splsda.srbct$class$mahalanobis.dist[,3] table(factor(predict.comp3, levels = levels(Y)), Y.test)
## Y.test ## EWS BL NB RMS ## EWS 6 0 0 0 ## BL 0 1 0 0 ## NB 0 0 4 0 ## RMS 0 0 0 2
AUC plots can be used for performance evaluation. AUC scores are calculated from training cross-validation sets and averaged in the
perf() function (
perf.plsda.srbct$auc.all for one vs. one class or one vs. all classes respectively).
However, ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance. AUROC curves use a cutoff that maximises specificity and sensitivity rather than this distance and hence should be used a merely a complementary tool.
AUROC plots for models containing one component and three components can be seen in Figure 10 (a) and (b) respectively. Figure 10(a) suggests that the sPLS-DA model can distinguish
BL subjects from the other groups with a high true positive and low false positive rate, while the model is less well able to distinguish samples from other classes on component 1. The model including all three components (Figure 10(b)) has a perfect classification accuracy. While the model is definitely improved by the addition of two components, it is the small testing set size which allows for this perfect score.
Note that if
print = TRUE (as is by default), numerical output including AUC and a Wilcoxon test p-value for each ‘one vs. other’ class comparisons that are performed per component will be printed.
auc.splsda = auroc(final.splsda, roc.comp = 1, print = FALSE) # AUROC for the first component auc.splsda = auroc(final.splsda, roc.comp = 3, print = FALSE) # AUROC for all three components
FIGURE 10: ROC curve and AUC from sPLS-DA on the SRBCT gene expression data on component 1 (a) and all three components (b) averaged across one-vs.-all comparisons.
More information on Plots
For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages:
- Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C. and Meltzer, P.S., 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6), pp.673-679.