Case Study of sPLSDA with SRBCT dataset
Partial Least Squares Discriminant Analysis (PLSDA) is a linear, multivariate model which uses the PLS algorithm to allow classification of categorically labelled data. PLSDA seeks for components that best separate the sample groups, whilst the sparse version also selects variables that best discriminate between groups.
For background information on the (s)PLSDA method, refer to the PLSDA Methods Page.
Rscript
The R script used for all the analysis in this case study is available here.
To begin
Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library
set.seed(5249) # for reproducibility, remove for normal use
The data
The Small Round Blue Cell Tumours (SRBCT) dataset was generating by studying the expression levels of a set of genes to test for markers of certain tumour types (Khan et al, 2001).
The mixOmics
SRBCT dataset is accessed via srbct
and contains the following:

srbct$gene
(continuous matrix): 63 rows and 2308 columns. The expression levels of the 2308 genes across the 63 tested subjects. 
srbct$class
(categorical vector): class vector of length 63. Contains the tumour class of each individual. There are four classes which include Burkitt Lymphoma (BL), Ewing Sarcoma (EWS), neuroblastoma (NB) and rhabdomyosarcoma (RMS). 
srbct$gene.name
(string vectors): two lists of length 2308. There is theImage.ID
component containing an integer ID for each gene. TheGene.Description
component briefly describes the nature of each gene.
To confirm the correct dataframe was extracted, the dimensions are checked. The distribution of class labels is also examined. It can be seen that these class labels are not balanced.
data(srbct) # extract the small round bull cell tumour data
X < srbct$gene # use the gene expression data as the X matrix
Y < srbct$class # use the class data as the Y matrix
dim(X) # check the dimensions of the X dataframe
## [1] 63 2308
summary(Y) # check the distribution of class labels
## EWS BL NB RMS ## 23 8 12 20
Initial Analysis
Preliminary Analysis with PCA
As in most cases when developing models, exploring the data to determine the major sources of variation is a good first step. PCA will be used for this. As described in the PCA Methods Page, centering and scaling is recommended to homogenize the variance across the genes. ncomp
is set to an arbitrarily high number to understand the captured variance across cotheremponents.
# run pca method on data
pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE)
plot(pca.srbct) # barplot of the eigenvalues (explained variance per component)
FIGURE 1: Barplot of the variance each principal component explains of the SRBCT gene expression data.
Two components would be sufficient to explain a moderate proportion of the data's variance according to Figure 1. Next, the data is projected onto these two components to attempt to observe sources of variation.
plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE, # plot the samples projected
legend = TRUE, title = 'PCA on SRBCT, comp 1  2') # onto the PCA subspace
FIGURE 2: Preliminary (unsupervised) analysis with PCA on the SRBCT gene expression data
It seems that different tumour types do not separate or cluster across the two Principal components of the data, as seen in Figure 2. There are clusters, but these are not explained by the class variable. It can be inferred then that the major source of variation is not attributed to tumour type. Note that Figure 2 has each sample coloured by the class. This is only done for visualisation after the PCA as it is an unsupervised approach.
Initial sPLSDA model
srbct.splsda < splsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later
A PLSDA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model. A sample plot, including confidence ellipses, is shown in Figure 3(a). This plot shows much better clustering of samples according to the tumour type when compared to the PCA output.
The background.predict()
function can also be utilised to depict the separation of class labels as seen in Figure 3(b). This plot provides intuition on how novel samples would be classified according to the model generated by sPLSDA.
# plot the samples projected onto the first two components of the PLSDA subspace
plotIndiv(srbct.splsda , comp = 1:2,
group = srbct$class, ind.names = FALSE, # colour points by class
ellipse = TRUE, # include 95% confidence ellipse for each class
legend = TRUE, title = '(a) PLSDA with confidence ellipses')
# use the max.dist measure to form decision boundaries between classes based on PLSDA data
background = background.predict(srbct.splsda, comp.predicted=2, dist = "max.dist")
# plot the samples projected onto the first two components of the PLSDA subspace
plotIndiv(srbct.splsda, comp = 1:2,
group = srbct$class, ind.names = FALSE, # colour points by class
background = background, # include prediction background for each class
legend = TRUE, title = " (b) PLSDA with prediction background")
FIGURE 3: Sample plots of the SRBCT gene expression data after a basic PLSDA model was operated on this data. (a) depicts the samples with the confidence ellipses of different class labels while (b) depicts the prediction background generated by these samples. Both plots use the first two components as axes.
Tuning sPLSDA
Selecting the number of components
The ncomp Parameter
The number of components to use is a crucial decision and is dictated by the performance of the PLSDA model – i.e. its ability to correctly classify novel samples. The perf()
function is used for this exactly. This is done with repeated crossvalidation. Based on the output of this function, the optimal number of components to use can be identified.
A threefold, 10 repeat crossvalidation procedure is utilised here. Generally, for datasets with numerous samples, at least 10 folds is recommended. 3 or 5 folds is appropriate for smaller datasets and those with minimal samples should use LeaveOneOut (LOO) validation. Consider using 50100 repeats to reduce the impact of the randomly allocated folds during each repeat.
The overall error rate (OER) and balanced error rate (BER) for the three different distance metrics (explained further below) across the first ten components are depicted in Figure 4.
# undergo performance evaluation in order to tune the number of components to use
perf.splsda.srbct < perf(srbct.splsda, validation = "Mfold",
folds = 5, nrepeat = 10, # use repeated crossvalidation
progressBar = FALSE, auc = TRUE) # include AUC values
# plot the outcome of performance evaluation across all ten components
plot(perf.splsda.srbct, col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal")
FIGURE 4: Tuning the number of components in PLSDA on the SRBCT gene expression data. For each component, repeated crossvalidation (10 × 3−fold CV) is used to evaluate the PLSDA classification performance (OER and BER), for each type of prediction distance; max.dist, centroids.dist and mahalanobis.dist.
From this, it seems three components are appropriate as the error for each distance metric decreases by very incremental amounts after this. Components beyond the third are likely to provide neglible returns to the classification accuracy. A more empirical way to select this number is through the $choice.ncomp
component of the perf()
output object. It runs ttests for a significant different in mean error rate across components. Using the max.dist
metric, this suggests that the optimal number of components is 4. When to use each distance metric is explained further below.
perf.splsda.srbct$choice.ncomp # what is the optimal value of components according to perf()
## max.dist centroids.dist mahalanobis.dist ## overall 4 5 7 ## BER 4 5 7
Selecting the number of variables
The keepX Parameter
In order to determine the number of variables used to construct each latent component, the tune.splsda()
function is utilised. This is performed iteratively, such that components are tuned one at a time. Through this function, the classification error rate can be extracted and averaged across folds and repeats. As mentioned above, an appropriate number of repeats would be around 50100.
Up to four components were selected as per Figure 4. In this case, fivefold cross validation was repeated ten times using the BER of max.dist
as the performance measure – where minimisation was optimal. The cpus
parameter allows for the use of parallelisation of computation as tuning can take a long time on lowtomid range processors.
The output of the tuning is shown in Figure 5. The diamond indicates the optimal number of variables to keep for a given component, selected by which keepX value achieves the lowest classification error rate as determined with a onesided t−test. The error bars indicate the standard deviation across the repeated, crossvalidated folds.
# grid of possible keepX values that will be tested for each component
list.keepX < c(1:10, seq(20, 300, 10))
# undergo the tuning process to determine the optimal number of variables
tune.splsda.srbct < tune.splsda(X, Y, ncomp = 4, # calculate for first 4 components
validation = 'Mfold',
folds = 5, nrepeat = 10, # use repeated crossvalidation
dist = 'max.dist', # use max.dist measure
measure = "BER", # use balanced error rate of dist measure
test.keepX = list.keepX,
cpus = 2) # allow for paralleliation to decrease runtime
plot(tune.splsda.srbct, col = color.jet(4)) # plot output of variable number tuning
FIGURE 5: Tuning keepX for the sPLSDA performed on the SRBCT gene expression data. Each coloured line represents the balanced error rate (yaxis) per component across all tested keepX values (xaxis) with the standard deviation based on the repeated crossvalidation folds. As sPLSDA is an iterative algorithm, values represented for a given component (e.g. comp 1 to 2) include the optimal keepX value chosen for the previous component (comp 1).
The above figure also aids in tuning the number of components. While the tuning of component number (through perf()
) yielded an optimal value of 4, conflicting results can be seen after the use of tune.splsda()
, such that the optimal value is claimed to be 3. After the optmisation of the selected features, the fourth component seemingly minimises the BER negligibly.
tune.splsda.srbct$choice.ncomp$ncomp # what is the optimal value of components according to tune.splsda()
## [1] 3
The exact quantity of features to use for each component can also be extracted from this object:
tune.splsda.srbct$choice.keepX # what are the optimal values of variables according to tune.splsda()
## comp1 comp2 comp3 comp4 ## 9 260 30 10
These values are stored to form the final, optimised model.
optimal.ncomp < tune.splsda.srbct$choice.ncomp$ncomp
optimal.keepX < tune.splsda.srbct$choice.keepX[1:optimal.ncomp]
Final Model
Using all the tuned parameters from above, the final sPLSDA model can be formed.
# form final model with optimised values for component and variable count
final.splsda < splsda(X, Y,
ncomp = optimal.ncomp,
keepX = optimal.keepX)
Plots
Sample Plots
Once again, sample plots will be used to show the distribution of the data in the latent space. The plots seen in Figure 6 contrast Figure 3 starkly. Figure 3 was yielded by an unoptimised, PLSDA model (no feature or component selection). Figure 6 shows the sample plots for the first and second components (a) and the first and third components (b). The difference between Figure 6 (a) and (b) is indicative of the fact that different genes discriminate the samples differently. Genes which contributed to the third component separated the RMS
and NB
classes much better than those which contributed to the second. All three components were well suited to separate the BL
class as it does not overlap any other cluster in either plot.
plotIndiv(final.splsda, comp = c(1,2), # plot samples from final model
group = srbct$class, ind.names = FALSE, # colour by class label
ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
title = ' (a) sPLSDA on SRBCT, comp 1 & 2')
plotIndiv(final.splsda, comp = c(1,3), # plot samples from final model
group = srbct$class, ind.names = FALSE, # colour by class label
ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
title = '(b) sPLSDA on SRBCT, comp 1 & 3')
FIGURE 6: Sample plots from sPLSDA performed on the SRBCT gene expression data including 95% confidence ellipses. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3. Samples are coloured by their tumour subtypes.
A Cluster Image Map (CIM) is shown in Figure 7. It depicts the expression levels of each gene (selected for component construction) for every sample. Euclidean distance with a complete agglomeration method were used to yield this CIM. It can be seen that certain sets of genes had homogeneous expression for different classes. For example, nearly half of the genes had high expression with the EWS
(blue) tumour.
# set the styling of the legend to be homogeneous with previous plots
legend=list(legend = levels(Y), # set of classes
col = unique(color.mixo(Y)), # set of colours
title = "Tumour Type", # legend title
cex = 0.7) # legend size
# generate the CIM, using the legend and colouring rows by each sample's class
cim < cim(final.splsda, row.sideColors = color.mixo(Y),
legend = legend)
FIGURE 7: Clustered Image Map of the genes selected by sPLSDA on the SRBCT gene expression data across all 3 components. A hierarchical clustering based on the gene expression levels of the selected genes, with samples in rows coloured according to their tumour subtype (using Euclidean distance with Complete agglomeration method).
Variable Plots
The stability of a given feature is defined as the proportion of cross validation folds (across repeats) where it was selected for to be used for a given component. Stability values can be extracted via perf.splsda.srbct$features$stable
. These stabilities can be plotted, seen in Figure 8. Those with the highest stability are likely to be much more “important” for a given component. The features used for the first component had consistently lower stability than the other two. This can be explained as there are various combinations of genes that are discriminative on component 1, whereas the number of combinations decreases as component 2 is formed.
# form new perf() object which utilises the final model
perf.splsda.srbct < perf(final.splsda,
folds = 5, nrepeat = 10, # use repeated crossvalidation
validation = "Mfold", dist = "max.dist", # use max.dist measure
progressBar = FALSE)
# plot the stability of each feature for the first three components, 'h' type refers to histogram
par(mfrow=c(1,3))
plot(perf.splsda.srbct$features$stable[[1]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(a) Comp 1', las =2)
plot(perf.splsda.srbct$features$stable[[2]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(b) Comp 2', las =2)
plot(perf.splsda.srbct$features$stable[[3]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(c) Comp 3', las =2)
FIGURE 8: Stability of variable selection from the sPLSDA on the SRBCT gene expression data. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 (a), 2 (b) and 3 ©.
Another variable plot to be used is the correlation circle plot. Figure 9 depicts this. By considering both the correlation circle plot and the sample plot (Figure 6(a)), a group of genes with a positive correlation with component 1 (EH domain
, proteasome
etc.) are observed to be associated with the BL
samples. Two groups of genes are either positively or negatively correlated with component 2. These genes are likely to characterise either the NB
and RMS
classes, or the EWS
class.
var.name.short < substr(srbct$gene.name[, 2], 1, 10) # form simplified gene names
plotVar(final.splsda, comp = c(1,2), var.names = list(var.name.short), cex = 3) # generate correlation circle plot
FIGURE 9: Correlation circle plot representing the genes selected by sPLSDA performed on the SRBCT gene expression data. Gene names are truncated to the first 10 characters. Only the genes selected by sPLSDA are shown in components 1 and 2.
Prediction
When undergoing prediction, the (s)PLSDA data must first be segmented into training and testing, such that there are novel samples to evaluate performance on. Otherwise, it runs the risk of “overfitting”, resulting in inflated predictive ability scores.
train < sample(1:nrow(X), 50) # randomly select 50 samples in training
test < setdiff(1:nrow(X), train) # rest is part of the test set
# store matrices into training and test set:
X.train < X[train, ]
X.test < X[test,]
Y.train < Y[train]
Y.test < Y[test]
A model is then trained on the training data. Note that the previously calculated optimal.keepX
values are used here. In real scenarios, the training model should be tuned itself. It is crucial that when tuning the training model, it is done in the absence of the testing data. This also reduces likelihood of overfitting.
# train the model
train.splsda.srbct < splsda(X.train, Y.train, ncomp = optimal.ncomp, keepX = optimal.keepX)
The model is then applied on the test set using a specific distance metric. In this case, the Mahalanobis distance was used (arbitrarily).
# use the model on the Xtest set
predict.splsda.srbct < predict(train.splsda.srbct, X.test,
dist = "mahalanobis.dist")
To evaluate the predictive performance, confusion matrices can be used. Directly below is such a matrix for a model using just the first two components. Only one misclassification were made – one sample was claimed to belong to the NB
class when it truthfully belonged to the RMS
class.
# evaluate the prediction accuracy for the first two components
predict.comp2 < predict.splsda.srbct$class$mahalanobis.dist[,2]
table(factor(predict.comp2, levels = levels(Y)), Y.test)
## Y.test ## EWS BL NB RMS ## EWS 5 0 0 0 ## BL 0 1 0 0 ## NB 1 0 4 1 ## RMS 0 0 0 1
Below is the equivalent matrix for the model using all three components. It can be seen that the classification accuracy increased.
# evaluate the prediction accuracy for the first three components
predict.comp3 < predict.splsda.srbct$class$mahalanobis.dist[,3]
table(factor(predict.comp3, levels = levels(Y)), Y.test)
## Y.test ## EWS BL NB RMS ## EWS 6 0 0 0 ## BL 0 1 0 0 ## NB 0 0 4 0 ## RMS 0 0 0 2
Performance Plots
AUC plots can be used for performance evaluation. AUC scores are calculated from training crossvalidation sets and averaged in the perf()
function (perf.plsda.srbct$auc
and perf.plsda.srbct$auc.all
for one vs. one class or one vs. all classes respectively).
However, ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLSDA is based on specified distance. AUROC curves use a cutoff that maximises specificity and sensitivity rather than this distance and hence should be used a merely a complementary tool.
AUROC plots for models containing one component and three components can be seen in Figure 10 (a) and (b) respectively. Figure 10(a) suggests that the sPLSDA model can distinguish BL
subjects from the other groups with a high true positive and low false positive rate, while the model is less well able to distinguish samples from other classes on component 1. The model including all three components (Figure 10(b)) has a perfect classification accuracy. While the model is definitely improved by the addition of two components, it is the small testing set size which allows for this perfect score.
Note that if print = TRUE
(as is by default), numerical output including AUC and a Wilcoxon test pvalue for each ‘one vs. other’ class comparisons that are performed per component will be printed.
auc.splsda = auroc(final.splsda, roc.comp = 1, print = FALSE) # AUROC for the first component
auc.splsda = auroc(final.splsda, roc.comp = 3, print = FALSE) # AUROC for all three components
FIGURE 10: ROC curve and AUC from sPLSDA on the SRBCT gene expression data on component 1 (a) and all three components (b) averaged across onevs.all comparisons.
More information on Plots
For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages: