PLS Discriminant Analysis (PLSDA)
Partial Least Squares was not originally designed for classification and discrimination problems, but has often been used for that purpose (Nguyen and Rocke, 2002; Tan et al., 2004). The response matrix Y is qualitative and is internally recoded as a dummy block matrix that records the membership of each observation, i.e. each of the response categories are coded via an indicator variable. The PLS regression (now PLSDA) is then run as if Y was a continuous matrix and works well in practice for large data sets where Linear Discriminant Analysis faces collinearity issues.

PLSDiscriminant Analysis (PLSDA, Barker and Rayens, 2003) is a linear classification model that is able to predict the class of new samples.

sparse PLSDA (sPLSDA) enables the selection of the most predictive or discriminative features in the data that help classify the samples (Lê Cao et al., 2011).
sparse PLS Discriminant Analysis (sPLSDA)
sPLSDA performs variable selection and classification in a one step procedure. sPLSDA is a special case of sparse PLS to allow variable selection. Here the variables are only selected in the X data set and in a supervised framework, i.e. we are selecting the Xvariables with respect to different classes of the samples.
Usage in mixOmics
The non sparse and sparse PLS are implemented in mixOmics via the functions plsda and splsda as displayed below. When possible, we strongly advise to work with a training and a testing set to assess the performance of the model, or use crossvalidation.
PLSDA
Similar to a PLSregression mode, the tuning parameters include the number of dimensions or components ncomp. We can rely on the estimation of the classification error rate using crossvalidation.
library(mixOmics)
data(liver.toxicity)
X < as.matrix(liver.toxicity$gene)
Y < as.factor(liver.toxicity$treatment[, 4])
## PLSDA function
plsda.res < plsda(X, Y, ncomp = 5) # where ncomp is the number of components wanted
We use the function perf to evaluate a PLSDA model, using 5fold crossvalidation repeated 10 times (for an accurate estimation, consider 50100 repeats when using foldCV – no repeat is necessary with looCV). In addition, the function is useful to decide on the optimal number of components to choose (either for PLSDA or sPLSDA, see the case study). The Balanced Error Rate (BER) and overall error rate is displayed for all prediction distances (see ?predict and details in suppl info in Rohart et al. (2017)):
# this code takes ~ 1 min to run
set.seed(2543) # for reproducibility here, only when the `cpus' argument is not used
perf.plsda < perf(plsda.res, validation = "Mfold", folds = 5,
progressBar = FALSE, auc = TRUE, nrepeat = 10)
# perf.plsda.srbct$error.rate # error rates
plot(perf.plsda, col = color.mixo(1:3), sd = TRUE, legend.position = "horizontal")
Here ncomp = 4 with max distance seems to achieve the best classification performance and can be retained for the final model, or for the sparse PLSDA model.
The AUROC can also be plotted, beware that it only complements the PLSDA performance results (see Rohart et al. (2017) for more details).
sPLSDA for variable selection
Tuning sPLSDA
We use the function tune.splsda to select the parameters including the number of components, ncomp, and the number of variables to choose in the X data set, keepX.
# grid of possible keepX values that will be tested for each comp
list.keepX < c(seq(10, 50, 10))
set.seed(2543) # for reproducibility here,
# to speed up the computational time, consider the cpu argument
# take ~ 4 min to run
tune.splsda < tune.splsda(X, Y, ncomp = 4, validation = 'Mfold', folds = 5,
progressBar = FALSE, dist = 'max.dist',
test.keepX = list.keepX, nrepeat = 10) #nrepeat 50100 for better estimate
# tune.splsda.srbct #the various outputs
The optimal number of features to select (per component):
tune.splsda$choice.keepX
## comp1 comp2 comp3 comp4 ## 40 10 20 10
The optimal number of components
tune.splsda$choice.ncomp$ncomp
## [1] 3
We include these parameters in our final sPLSDA model:
choice.ncomp < tune.splsda$choice.ncomp$ncomp
choice.keepX < tune.splsda$choice.keepX[1:choice.ncomp]
## sPLSDA function
splsda.res < splsda(X, Y, ncomp = choice.ncomp, keepX = choice.keepX) # where keepX is the number of variables selected for each components
The performance of our final sPLSDA model is:
perf.splsda < perf(splsda.res, validation = "Mfold", folds = 5,
progressBar = FALSE, auc = TRUE, nrepeat = 10)
perf.splsda$error.rate
## $overall ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.5265625 0.5578125 0.5578125 ## comp 2 0.1718750 0.2265625 0.2234375 ## comp 3 0.0484375 0.0765625 0.0890625 ## ## $BER ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.5265625 0.5578125 0.5578125 ## comp 2 0.1718750 0.2265625 0.2234375 ## comp 3 0.0484375 0.0765625 0.0890625
# break down of error rate per class is also insightful on the prediction
# of the model:
#perf.splsda$error.rate.class
The AUROC can also be plotted, beware that it only complements the PLSDA performance results (see Rohart et al. (2017) for more details).
The final selection of features can be output, along with their weight coefficient (most important based on their aboslute value) using the selectVar function:
selectVar(splsda.res, comp = 1)$value
Prediction
With PLSDA and sPLSDA, the classes of new samples or observations can be predicted in the model by using the predict.plsda and predict.splsda functions. Here we illustrate one example where we 'pretend' to have an external data set as a test data set. The prediction distance can be specified (see Suppl info in ROhart et al. 2017).
set.seed(2543) # for reproducibility here
# Creation of a randomised set of sample
samp < sample(1:3, nrow(X), replace = TRUE)
# 1/3 of the data will compose the test set
test < which(samp == 1)
# rest will compose the training set
train < setdiff(1:nrow(X), test)
Now that the training and test set are set up, we run a PLSDA model and look at the prediction for the test samples:
## For PLSDA, train the model
plsda.train < plsda(X[train, ], Y[train], ncomp = 4)
# then predict
test.predict < predict(plsda.train, X[test, ], dist = "max.dist")
# store prediction for the 4th component
prediction < test.predict$class$max.dist[,4]
# calculate the error rate of the model
confusion.mat = get.confusion_matrix(truth = Y[test], predicted = prediction)
get.BER(confusion.mat)
## [1] 0.15
Similarly we run a sparse PLSDA model. The tuning should be performed on the training set only to avoid overfitting! Here we set the keepX arbitrarily:
splsda.train < splsda(X[train, ], Y[train], ncomp = 4, keepX = c(10,20,10))
test.predict < predict(splsda.train, X[test, ], dist = "max.dist")
# store prediction for the 4th component
prediction < test.predict$class$max.dist[,4]
# calculate the error rate of the model
confusion.mat = get.confusion_matrix(truth = Y[test], predicted = prediction)
get.BER(confusion.mat)
## [1] 0
Case study
See Case Study: sPLSDA srbct for more details and plotting options.
Reference
In addition to references from (s)PLS.