PLS-DA

PLS Discriminant Analysis (PLS-DA)

Partial Least Squares was not originally designed for classification and discrimination problems, but has often been used for that purpose (Nguyen and Rocke, 2002; Tan et al., 2004). The response matrix Y is qualitative and is internally recoded as a dummy block matrix that records the membership of each observation, i.e. each of the response categories are coded via an indicator variable. The PLS regression (now PLS-DA) is then run as if Y was a continuous matrix and works well in practice for large data sets where Linear Discriminant Analysis faces collinearity issues.

  • PLS-Discriminant Analysis (PLS-DA, Barker and Rayens, 2003) is a linear classification model that is able to predict the class of new samples.

  • sparse PLS-DA (sPLS-DA) enables the selection of the most predictive or discriminative features in the data that help classify the samples (Lê Cao et al., 2011).

sparse PLS Discriminant Analysis (sPLS-DA)

sPLS-DA performs variable selection and classification in a one step procedure. sPLS-DA is a special case of sparse PLS to allow variable selection. Here the variables are only selected in the X data set and in a supervised framework, i.e. we are selecting the X-variables with respect to different classes of the samples.

Usage in mixOmics

The non sparse and sparse PLS are implemented in mixOmics via the functions plsda and splsda as displayed below. When possible, we strongly advise to work with a training and a testing set to assess the performance of the model, or use cross-validation.

PLS-DA

Similar to a PLS-regression mode, the tuning parameters include the number of dimensions or components ncomp. We can rely on the estimation of the classification error rate using cross-validation.

library(mixOmics)
data(liver.toxicity)
X <- as.matrix(liver.toxicity$gene)
Y <- as.factor(liver.toxicity$treatment[, 4])             

## PLS-DA function
plsda.res <- plsda(X, Y, ncomp = 5) # where ncomp is the number of components wanted

We use the function perf to evaluate a PLS-DA model, using 5-fold cross-validation repeated 10 times (for an accurate estimation, consider 50-100 repeats when using fold-CV – no repeat is necessary with loo-CV). In addition, the function is useful to decide on the optimal number of components to choose (either for PLSDA or sPLSDA, see the case study). The Balanced Error Rate (BER) and overall error rate is displayed for all prediction distances (see ?predict and details in suppl info in Rohart et al. (2017)):

# this code takes ~ 1 min to run
set.seed(2543) # for reproducibility here, only when the `cpus' argument is not used
perf.plsda <- perf(plsda.res, validation = "Mfold", folds = 5, 
                  progressBar = FALSE, auc = TRUE, nrepeat = 10) 
# perf.plsda.srbct$error.rate  # error rates
plot(perf.plsda, col = color.mixo(1:3), sd = TRUE, legend.position = "horizontal")

plot of chunk unnamed-chunk-2

Here ncomp = 4 with max distance seems to achieve the best classification performance and can be retained for the final model, or for the sparse PLSDA model.

The AUROC can also be plotted, beware that it only complements the PLSDA performance results (see Rohart et al. (2017) for more details).

sPLS-DA for variable selection

Tuning sPLS-DA

We use the function tune.splsda to select the parameters including the number of components, ncomp, and the number of variables to choose in the X data set, keepX.

# grid of possible keepX values that will be tested for each comp 
list.keepX <- c(seq(10, 50, 10))
set.seed(2543) # for reproducibility here,
# to speed up the computational time, consider the cpu argument
# take ~ 4 min to run
tune.splsda <- tune.splsda(X, Y, ncomp = 4, validation = 'Mfold', folds = 5, 
                           progressBar = FALSE, dist = 'max.dist',
                           test.keepX = list.keepX, nrepeat = 10) #nrepeat 50-100 for better estimate
# tune.splsda.srbct  #the various outputs

The optimal number of features to select (per component):

tune.splsda$choice.keepX
## comp1 comp2 comp3 comp4 
##    40    10    20    10

The optimal number of components

tune.splsda$choice.ncomp$ncomp
## [1] 3

We include these parameters in our final sPLSDA model:

choice.ncomp <- tune.splsda$choice.ncomp$ncomp
choice.keepX <- tune.splsda$choice.keepX[1:choice.ncomp]
## sPLS-DA function
splsda.res <- splsda(X, Y, ncomp = choice.ncomp, keepX = choice.keepX) # where keepX is the number of variables selected for each components

The performance of our final sPLSDA model is:

perf.splsda <- perf(splsda.res, validation = "Mfold", folds = 5, 
                  progressBar = FALSE, auc = TRUE, nrepeat = 10) 

perf.splsda$error.rate
## $overall
##         max.dist centroids.dist mahalanobis.dist
## comp 1 0.5265625      0.5578125        0.5578125
## comp 2 0.1718750      0.2265625        0.2234375
## comp 3 0.0484375      0.0765625        0.0890625
## 
## $BER
##         max.dist centroids.dist mahalanobis.dist
## comp 1 0.5265625      0.5578125        0.5578125
## comp 2 0.1718750      0.2265625        0.2234375
## comp 3 0.0484375      0.0765625        0.0890625
# break down of error rate per class is also insightful on the prediction
# of the model:
#perf.splsda$error.rate.class

The AUROC can also be plotted, beware that it only complements the PLSDA performance results (see Rohart et al. (2017) for more details).

The final selection of features can be output, along with their weight coefficient (most important based on their aboslute value) using the selectVar function:

selectVar(splsda.res, comp = 1)$value

Prediction

With PLS-DA and sPLS-DA, the classes of new samples or observations can be predicted in the model by using the predict.plsda and predict.splsda functions. Here we illustrate one example where we 'pretend' to have an external data set as a test data set. The prediction distance can be specified (see Suppl info in ROhart et al. 2017).

set.seed(2543) # for reproducibility here
# Creation of a randomised set of sample
samp <- sample(1:3, nrow(X), replace = TRUE) 

# 1/3 of the data will compose the test set
test <- which(samp == 1) 
# rest will compose the training set
train <- setdiff(1:nrow(X), test) 

Now that the training and test set are set up, we run a PLS-DA model and look at the prediction for the test samples:

## For PLS-DA, train the model
plsda.train <- plsda(X[train, ], Y[train], ncomp = 4)
# then predict
test.predict <- predict(plsda.train, X[test, ], dist = "max.dist")
# store prediction for the 4th component
prediction <- test.predict$class$max.dist[,4] 
# calculate the error rate of the model
confusion.mat = get.confusion_matrix(truth = Y[test], predicted = prediction)
get.BER(confusion.mat)
## [1] 0.15

Similarly we run a sparse PLS-DA model. The tuning should be performed on the training set only to avoid overfitting! Here we set the keepX arbitrarily:

splsda.train <- splsda(X[train, ], Y[train], ncomp = 4, keepX = c(10,20,10))
test.predict <- predict(splsda.train, X[test, ], dist = "max.dist")
# store prediction for the 4th component
prediction <- test.predict$class$max.dist[,4] 
# calculate the error rate of the model
confusion.mat = get.confusion_matrix(truth = Y[test], predicted = prediction)
get.BER(confusion.mat)
## [1] 0

Case study

See Case Study: sPLS-DA srbct for more details and plotting options.

Reference

In addition to references from (s)PLS.

Pérez-Enciso, M. and Tenenhaus, M., 2003. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Human genetics, 112(5-6), pp.581-592.

Nguyen, D.V. and Rocke, D.M., 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), pp.39-50.

Lê Cao, K.A., Boitard, S. and Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12(1), p.253

Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.