PLS Discriminat Anlysis (PLSDA)
Although Partial Least Squares was not originally designed for classification and discrimination prob lems, it has often been used for that purpose (Nguyen and Rocke, 2002; Tan et al., 2004). The response matrix Y is qualitative and is internally recoded as a dummy block matrix that records the membership of each observation, i.e. each of the response categories are coded via an indicator variable. The PLS regression (now PLSDA) is then run as if Y was a continuous matrix. Note that this might be wrong from a theoretical point of view, however, it has been previously shown that this works well in practice.

PLSDiscriminant Analysis (PLSDA) is a linear model which performs a classification task and is able to predict the class of new samples (Barker and Rayens, 2003).

The sparse variant sparse PLSDA (sPLSDA) enables the selection of the most predictive or discriminative features in the data that help classify the samples (Lê Cao et al., 2011).
sparse PLS Discriminat Analysis (sPLSDA)
sPLSDA performs variable selection and classification in a one step procedure. sPLSDA is a special case of sparse PLS to allow variable selection, except that this time, the variables are only selected in the X data set and in a supervised framework, i.e. we are selecting the Xvariables with respect to different categories of the samples.
Usage in mixOmics
(s)PLSDA is implemented in mixOmics via the functions plsda() and splsda() as displayed below. For both plsda and splsda, we strongly advise to work with a training and a testing set.
PLSDA
Variable selection
Similar to a PLSregression mode, the tuning parameters include the number of dimensions or components ncomp. We can rely on the estimation of the classification error rate using leaveoneout or crossvalidation. We use the function perf() to evaluate a PLSDA model and predict().
library(mixOmics)
data(liver.toxicity)
X < as.matrix(liver.toxicity$gene)
Y < as.factor(liver.toxicity$treatment[, 4])
## PLSDA function
result < plsda(X, Y, ncomp = 3) # where ncomp is the number of components wanted
sPLSDA
Tuning sPLSDA
We use the function tune.splsda() to select the parameters including the number of components/dimensions, ncomp, and the number of variables to choose in the X data set, keepX.
# grid of possible keepX values that will be tested for comp 1 and comp 2
list.keepX < c(1:10, seq(20, 100, 10))
# to speed up computation in this example we choose 5 folds repeated 10 times:
tune.splsda.srbct < tune.splsda(X, Y, ncomp = 5, validation = 'Mfold', folds = 5,
progressBar = FALSE, dist = 'mahalanobis.dist',
test.keepX = list.keepX, nrepeat = 10) #nrepeat 50100
## sPLSDA function
result < splsda(X, Y, ncomp = 3, keepX = c(10, 10, 10)) # where keepX is the number of variables selected for each components
With (s)PLSDA, the classes of new samples or observations can be predicted in the model by using thepredict function. This is an example to perform 3fold crossvalidation. Normally 10fold crossvalidation should be performed several times and the results should be averaged to get a better estimation of the generalization performance:
data(liver.toxicity)
X < as.matrix(liver.toxicity$gene)
Y < as.factor(liver.toxicity$treatment[, 4]) # Y is a factor, we chose it as the time points of necropsy
i < 1
samp < sample(1:3, nrow(X), replace = TRUE) # Creation of a list of the same size as X containing 1, 2 or 3
test < which(samp == i) # Search which column in samp has a value of 1
train < setdiff(1:nrow(X), test) # Keeping the column that are not in test
## For PLSDA
plsda.train < plsda(X[train, ], Y[train], ncomp = 3)
test.predict < predict(plsda.train, X[test, ], method = "max.dist")
## For sPLSDA
splsda.train < splsda(X[train, ], Y[train], ncomp = 3, keepX = c(10, 10, 10))
test.predict < predict(splsda.train, X[test, ], method = "max.dist")
See Case Study: sPLSDA srbct for more details and plotting options.
Reference
In addintion to references form (s)PLS.