Partial Least Squares Discriminant Analysis (PLS-DA) is a linear, multivariate model which uses the PLS algorithm to allow classification of categorically labelled data. PLS-DA seeks for components that best separate the sample groups, whilst the sparse version also selects variables that best discriminate between groups.
Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library
set.seed(5249) # for reproducibility, remove for normal use
The Small Round Blue Cell Tumours (SRBCT) dataset was generating by studying the expression levels of a set of genes to test for markers of certain tumour types (Khan et al, 2001).
The mixOmics
SRBCT dataset is accessed via
srbct
and contains the following:
srbct$gene
(continuous matrix): 63 rows and 2308
columns. The expression levels of the 2308 genes across the 63 tested
subjects.
srbct$class
(categorical vector): class vector of
length 63. Contains the tumour class of each individual. There are four
classes which include Burkitt Lymphoma (BL), Ewing Sarcoma (EWS),
neuroblastoma (NB) and rhabdomyosarcoma (RMS).
srbct$gene.name
(string vectors): two lists of
length 2308. There is the Image.ID
component containing an
integer ID for each gene. The Gene.Description
component
briefly describes the nature of each gene.
To confirm the correct dataframe was extracted, the dimensions are checked. The distribution of class labels is also examined. It can be seen that these class labels are not balanced.
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
dim(X) # check the dimensions of the X dataframe
## [1] 63 2308
summary(Y) # check the distribution of class labels
## EWS BL NB RMS
## 23 8 12 20
As in most cases when developing models, exploring the data to
determine the major sources of variation is a good first step. PCA will
be used for this. As described in the PCA Methods Page, centering and
scaling is recommended to homogenize the variance across the genes.
ncomp
is set to an arbitrarily high number to understand
the captured variance across cotheremponents.
# run pca method on data
pca.srbct = pca(X, ncomp = 10, center = TRUE, scale = TRUE)
plot(pca.srbct) # barplot of the eigenvalues (explained variance per component)
FIGURE 1: Barplot of the variance each principal component explains of the SRBCT gene expression data.
Two components would be sufficient to explain a moderate proportion of the data’s variance according to Figure 1. Next, the data is projected onto these two components to attempt to observe sources of variation.
plotIndiv(pca.srbct, group = srbct$class, ind.names = FALSE, # plot the samples projected
legend = TRUE, title = 'PCA on SRBCT, comp 1 - 2') # onto the PCA subspace
FIGURE 2: Preliminary (unsupervised) analysis with PCA on the SRBCT gene expression data
It seems that different tumour types do not separate or cluster across the two Principal components of the data, as seen in Figure 2. There are clusters, but these are not explained by the class variable. It can be inferred then that the major source of variation is not attributed to tumour type. Note that Figure 2 has each sample coloured by the class. This is only done for visualisation after the PCA as it is an unsupervised approach.
srbct.splsda <- splsda(X, Y, ncomp = 10) # set ncomp to 10 for performance assessment later
A PLS-DA model is fitted with ten components to evaluate the performance and the number of components necessary for the final model. A sample plot, including confidence ellipses, is shown in Figure 3(a). This plot shows much better clustering of samples according to the tumour type when compared to the PCA output.
The background.predict()
function can also be utilised
to depict the separation of class labels as seen in Figure 3(b). This
plot provides intuition on how novel samples would be classified
according to the model generated by sPLS-DA.
# plot the samples projected onto the first two components of the PLS-DA subspace
plotIndiv(srbct.splsda , comp = 1:2,
group = srbct$class, ind.names = FALSE, # colour points by class
ellipse = TRUE, # include 95% confidence ellipse for each class
legend = TRUE, title = '(a) PLSDA with confidence ellipses')
# use the max.dist measure to form decision boundaries between classes based on PLS-DA data
background = background.predict(srbct.splsda, comp.predicted=2, dist = "max.dist")
# plot the samples projected onto the first two components of the PLS-DA subspace
plotIndiv(srbct.splsda, comp = 1:2,
group = srbct$class, ind.names = FALSE, # colour points by class
background = background, # include prediction background for each class
legend = TRUE, title = " (b) PLSDA with prediction background")
FIGURE 3: Sample plots of the SRBCT gene expression data after a basic PLS-DA model was operated on this data. (a) depicts the samples with the confidence ellipses of different class labels while (b) depicts the prediction background generated by these samples. Both plots use the first two components as axes.
The number of components to use is a crucial decision and is dictated
by the performance of the PLS-DA model - i.e. its ability to correctly
classify novel samples. The perf()
function is used for
this exactly. This is done with repeated cross-validation. Based on the
output of this function, the optimal number of components to use can be
identified.
A three-fold, 10 repeat cross-validation procedure is utilised here. Generally, for datasets with numerous samples, at least 10 folds is recommended. 3 or 5 folds is appropriate for smaller datasets and those with minimal samples should use Leave-One-Out (LOO) validation. Consider using 50-100 repeats to reduce the impact of the randomly allocated folds during each repeat.
The overall error rate (OER) and balanced error rate (BER) for the three different distance metrics (explained further below) across the first ten components are depicted in Figure 4.
# undergo performance evaluation in order to tune the number of components to use
perf.splsda.srbct <- perf(srbct.splsda, validation = "Mfold",
folds = 5, nrepeat = 10, # use repeated cross-validation
progressBar = FALSE, auc = TRUE) # include AUC values
# plot the outcome of performance evaluation across all ten components
plot(perf.splsda.srbct, col = color.mixo(5:7), sd = TRUE,
legend.position = "horizontal")
FIGURE 4: Tuning the number of components in PLS-DA on the SRBCT gene
expression data. For each component, repeated cross-validation (10 ×
3−fold CV) is used to evaluate the PLS-DA classification performance
(OER and BER), for each type of prediction distance;
max.dist
, centroids.dist
and
mahalanobis.dist
.
From this, it seems three components are appropriate as the error for
each distance metric decreases by very incremental amounts after this.
Components beyond the third are likely to provide neglible returns to
the classification accuracy. A more empirical way to select this number
is through the $choice.ncomp
component of the
perf()
output object. It runs t-tests for a significant
different in mean error rate across components. Using the
max.dist
metric, this suggests that the optimal number of
components is 4. When to use each distance metric is explained further
below.
perf.splsda.srbct$choice.ncomp # what is the optimal value of components according to perf()
## max.dist centroids.dist mahalanobis.dist
## overall 4 4 7
## BER 4 4 7
In order to determine the number of variables used to construct each
latent component, the tune.splsda()
function is utilised.
This is performed iteratively, such that components are tuned one at a
time. Through this function, the classification error rate can be
extracted and averaged across folds and repeats. As mentioned above, an
appropriate number of repeats would be around 50-100.
Up to four components were selected as per Figure 4. In this case,
five-fold cross validation was repeated ten times using the BER of
max.dist
as the performance measure - where minimisation
was optimal. The cpus
parameter allows for the use of
parallelisation of computation as tuning can take a long time on
low-to-mid range processors.
The output of the tuning is shown in Figure 5. The diamond indicates the optimal number of variables to keep for a given component, selected by which keepX value achieves the lowest classification error rate as determined with a one-sided t−test. The error bars indicate the standard deviation across the repeated, cross-validated folds.
# grid of possible keepX values that will be tested for each component
list.keepX <- c(1:10, seq(20, 300, 10))
# undergo the tuning process to determine the optimal number of variables
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 4, # calculate for first 4 components
validation = 'Mfold',
folds = 5, nrepeat = 10, # use repeated cross-validation
dist = 'max.dist', # use max.dist measure
measure = "BER", # use balanced error rate of dist measure
test.keepX = list.keepX)
plot(tune.splsda.srbct, col = color.jet(4)) # plot output of variable number tuning
FIGURE 5: Tuning keepX for the sPLS-DA performed on the SRBCT gene expression data. Each coloured line represents the balanced error rate (y-axis) per component across all tested keepX values (x-axis) with the standard deviation based on the repeated cross-validation folds. As sPLS-DA is an iterative algorithm, values represented for a given component (e.g. comp 1 to 2) include the optimal keepX value chosen for the previous component (comp 1).
The above figure also aids in tuning the number of components. While
the tuning of component number (through perf()
) yielded an
optimal value of 4, conflicting results can be seen after the use of
tune.splsda()
, such that the optimal value is claimed to be
3. After the optmisation of the selected features, the fourth component
seemingly minimises the BER negligibly.
tune.splsda.srbct$choice.ncomp$ncomp # what is the optimal value of components according to tune.splsda()
## [1] 3
The exact quantity of features to use for each component can also be extracted from this object:
tune.splsda.srbct$choice.keepX # what are the optimal values of variables according to tune.splsda()
## comp1 comp2 comp3 comp4
## 10 290 240 80
These values are stored to form the final, optimised model.
optimal.ncomp <- tune.splsda.srbct$choice.ncomp$ncomp
optimal.keepX <- tune.splsda.srbct$choice.keepX[1:optimal.ncomp]
Using all the tuned parameters from above, the final sPLS-DA model can be formed.
# form final model with optimised values for component and variable count
final.splsda <- splsda(X, Y,
ncomp = optimal.ncomp,
keepX = optimal.keepX)
Once again, sample plots will be used to show the distribution of the
data in the latent space. The plots seen in Figure 6 contrast Figure 3
starkly. Figure 3 was yielded by an unoptimised, PLS-DA model (no
feature or component selection). Figure 6 shows the sample plots for the
first and second components (a) and the first and third components (b).
The difference between Figure 6 (a) and (b) is indicative of the fact
that different genes discriminate the samples differently. Genes which
contributed to the third component separated the RMS
and
NB
classes much better than those which contributed to the
second. All three components were well suited to separate the
BL
class as it does not overlap any other cluster in either
plot.
plotIndiv(final.splsda, comp = c(1,2), # plot samples from final model
group = srbct$class, ind.names = FALSE, # colour by class label
ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
title = ' (a) sPLS-DA on SRBCT, comp 1 & 2')
plotIndiv(final.splsda, comp = c(1,3), # plot samples from final model
group = srbct$class, ind.names = FALSE, # colour by class label
ellipse = TRUE, legend = TRUE, # include 95% confidence ellipse
title = '(b) sPLS-DA on SRBCT, comp 1 & 3')
FIGURE 6: Sample plots from sPLS-DA performed on the SRBCT gene expression data including 95% confidence ellipses. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3. Samples are coloured by their tumour subtypes.
A Cluster Image Map (CIM) is shown in Figure 7. It depicts the
expression levels of each gene (selected for component construction) for
every sample. Euclidean distance with a complete agglomeration method
were used to yield this CIM. It can be seen that certain sets of genes
had homogeneous expression for different classes. For example, nearly
half of the genes had high expression with the EWS
(blue)
tumour.
# set the styling of the legend to be homogeneous with previous plots
legend=list(legend = levels(Y), # set of classes
col = unique(color.mixo(Y)), # set of colours
title = "Tumour Type", # legend title
cex = 0.7) # legend size
# generate the CIM, using the legend and colouring rows by each sample's class
cim <- cim(final.splsda, row.sideColors = color.mixo(Y),
legend = legend)
FIGURE 7: Clustered Image Map of the genes selected by sPLS-DA on the SRBCT gene expression data across all 3 components. A hierarchical clustering based on the gene expression levels of the selected genes, with samples in rows coloured according to their tumour subtype (using Euclidean distance with Complete agglomeration method).
The stability of a given feature is defined as the proportion of
cross validation folds (across repeats) where it was selected for to be
used for a given component. Stability values can be extracted via
perf.splsda.srbct$features$stable
. These stabilities can be
plotted, seen in Figure 8. Those with the highest stability are likely
to be much more “important” for a given component. The features used for
the first component had consistently lower stability than the other two.
This can be explained as there are various combinations of genes that
are discriminative on component 1, whereas the number of combinations
decreases as component 2 is formed.
# form new perf() object which utilises the final model
perf.splsda.srbct <- perf(final.splsda,
folds = 5, nrepeat = 10, # use repeated cross-validation
validation = "Mfold", dist = "max.dist", # use max.dist measure
progressBar = FALSE)
# plot the stability of each feature for the first three components, 'h' type refers to histogram
par(mfrow=c(1,3))
plot(perf.splsda.srbct$features$stable[[1]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(a) Comp 1', las =2)
plot(perf.splsda.srbct$features$stable[[2]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(b) Comp 2', las =2)
plot(perf.splsda.srbct$features$stable[[3]], type = 'h',
ylab = 'Stability',
xlab = 'Features',
main = '(c) Comp 3', las =2)
FIGURE 8: Stability of variable selection from the sPLS-DA on the SRBCT gene expression data. The barplot represents the frequency of selection across repeated CV folds for each selected gene for component 1 (a), 2 (b) and 3 (c).
Another variable plot to be used is the correlation circle plot.
Figure 9 depicts this. By considering both the correlation circle plot
and the sample plot (Figure 6(a)), a group of genes with a positive
correlation with component 1 (EH domain
,
proteasome
etc.) are observed to be associated with the
BL
samples. Two groups of genes are either positively or
negatively correlated with component 2. These genes are likely to
characterise either the NB
and RMS
classes, or
the EWS
class.
var.name.short <- substr(srbct$gene.name[, 2], 1, 10) # form simplified gene names
plotVar(final.splsda, comp = c(1,2), var.names = list(var.name.short), cex = 3) # generate correlation circle plot
FIGURE 9: Correlation circle plot representing the genes selected by sPLS-DA performed on the SRBCT gene expression data. Gene names are truncated to the first 10 characters. Only the genes selected by sPLS-DA are shown in components 1 and 2.
When undergoing prediction, the (s)PLS-DA data must first be segmented into training and testing, such that there are novel samples to evaluate performance on. Otherwise, it runs the risk of “overfitting”, resulting in inflated predictive ability scores.
train <- sample(1:nrow(X), 50) # randomly select 50 samples in training
test <- setdiff(1:nrow(X), train) # rest is part of the test set
# store matrices into training and test set:
X.train <- X[train, ]
X.test <- X[test,]
Y.train <- Y[train]
Y.test <- Y[test]
A model is then trained on the training data. Note that the
previously calculated optimal.keepX
values are used here.
In real scenarios, the training model should be tuned itself. It is
crucial that when tuning the training model, it is done in the absence
of the testing data. This also reduces likelihood of overfitting.
# train the model
train.splsda.srbct <- splsda(X.train, Y.train, ncomp = optimal.ncomp, keepX = optimal.keepX)
The model is then applied on the test set using a specific distance metric. In this case, the Mahalanobis distance was used (arbitrarily).
# use the model on the Xtest set
predict.splsda.srbct <- predict(train.splsda.srbct, X.test,
dist = "mahalanobis.dist")
To evaluate the predictive performance, confusion matrices can be
used. Directly below is such a matrix for a model using just the first
two components. Only one misclassification were made - one sample was
claimed to belong to the NB
class when it truthfully
belonged to the RMS
class.
# evaluate the prediction accuracy for the first two components
predict.comp2 <- predict.splsda.srbct$class$mahalanobis.dist[,2]
table(factor(predict.comp2, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 3 0 0 0
## BL 0 0 0 0
## NB 2 0 3 1
## RMS 0 0 0 4
Below is the equivalent matrix for the model using all three components. It can be seen that the classification accuracy increased.
# evaluate the prediction accuracy for the first three components
predict.comp3 <- predict.splsda.srbct$class$mahalanobis.dist[,3]
table(factor(predict.comp3, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 3 0 0 0
## BL 0 0 0 0
## NB 1 0 3 1
## RMS 1 0 0 4
AUC plots can be used for performance evaluation. AUC scores are
calculated from training cross-validation sets and averaged in the
perf()
function (perf.plsda.srbct$auc
and
perf.plsda.srbct$auc.all
for one vs. one class or one
vs. all classes respectively).
However, ROC and AUC criteria may not be particularly insightful, or be in agreement with the PLSDA performance, as the prediction threshold in PLS-DA is based on specified distance. AUROC curves use a cutoff that maximises specificity and sensitivity rather than this distance and hence should be used a merely a complementary tool.
AUROC plots for models containing one component and three components
can be seen in Figure 10 (a) and (b) respectively. Figure 10(a) suggests
that the sPLS-DA model can distinguish BL
subjects from the
other groups with a high true positive and low false positive rate,
while the model is less well able to distinguish samples from other
classes on component 1. The model including all three components (Figure
10(b)) has a perfect classification accuracy. While the model is
definitely improved by the addition of two components, it is the small
testing set size which allows for this perfect score.
Note that if print = TRUE
(as is by default), numerical
output including AUC and a Wilcoxon test p-value for each ‘one
vs. other’ class comparisons that are performed per component will be
printed.
auc.splsda = auroc(final.splsda, roc.comp = 1, print = FALSE) # AUROC for the first component
auc.splsda = auroc(final.splsda, roc.comp = 3, print = FALSE) # AUROC for all three components
FIGURE 10: ROC curve and AUC from sPLS-DA on the SRBCT gene expression data on component 1 (a) and all three components (b) averaged across one-vs.-all comparisons.