library(mixOmics) # import the mixOmics library data(srbct) # extract the small round bull cell tumour data X <- srbct$gene # use the gene expression data as the X matrix Y <- srbct$class # use the class data as the Y matrix
result.plsda.srbct <- plsda(X, Y) # run the method plotIndiv(result.plsda.srbct) # plot the samples plotVar(result.plsda.srbct) # plot the variables
?plsda can be run to determine the default arguments of this function:
- Number of Components (
ncomp = 2): The first two PLS-DA components are calculated,
- Scaling of data (
scale = TRUE): Each data set is scaled (each variable has a variance of 1 to enable easier comparison) – data is internally centered.
splsda.result <- splsda(X, Y, keepX = c(50,30)) # run the method plotIndiv(splsda.result) # plot the samples plotVar(splsda.result) # plot the variables # extract the variables used to construct the first latent component selectVar(splsda.result, comp = 1)$name # depict weight assigned to each of these variables plotLoadings(splsda.result, method = 'mean', contrib = 'max')
?splsda can be run to determine the default arguments of this function:
keepxis not supplied, this function will be equivalent to the
plsda()function as all variables will be used.
- Uses the same defaults for
PLS Discriminant Analysis
PLS was designed with a canonical (exploratory) approach and a regression (explanatory) approach in mind. Partial Least Squares – Discriminant Analysis (PLS-DA) was hence developed to allow the powerful PLS algorithm to be used for classification [1, 2]. It performs very similarly to PLS, just that the response vector y contains categorical vectors rather than continuous vectors. PLS-DA has the same advantages that PLS does, such that it operates efficiently over large dataframes and is not negatively influenced by collinearity.
Sparse PLS Discriminant Analysis
The sparse variant (sPLS-DA) enables the selection of the most predictive or discriminative features in the data to classify the samples . sPLS-DA performs variable selection and classification in a one-step procedure. It is a special case of sparse PLS, where the lasso penalisation applies only on the loading vector associated to the X data set.
Principles of (s)PLS-DA
While PLS was designed for regression and exploratory purposes, it can be applied in classification contexts. Internally, the y vector is converted to a dummy block matrix, Y, (ie. 'one hot encoded') of size \(N * K\), where \(N\) is the number of samples and \(K\) is the number of classes. The standard PLS regression algorithm then operates on this new dataframe. Classification uses the projection of the data onto the components yielded by PLS which are defined by their corresponding loading vectors. Refer to the Distance Metrics page for more information on how classifications are actually made.
The implementation of PLS-DA functions equivalently on large datasets and better on smaller datasets when compared to equivalent classification methods, such as Linear (Fisher's) Discriminant Analysis . This is especially true in multiclass cases (more than 2 classes) as the PLS-DA model does not require the construction of various 2-class submodels.
When evaluating the classification performance of (s)PLS-DA models, repeated cross-validation is used. Generally, 5 or 10 folds and 50-100 repeats is appropriate. As overfitting is always a risk when undergoing classification, these repeats are used to ensure that feature selection is occurring optimally.
See Case Study: sPLS-DA SRBCT for more details and plotting options.
Pérez-Enciso, M. and Tenenhaus, M., 2003. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Human genetics, 112(5-6), pp.581-592.
Nguyen, D.V. and Rocke, D.M., 2002. Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), pp.39-50.
Lê Cao, K.A., Boitard, S. and Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12(1), p.253
FISHER, R. (1936). THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS. Annals Of Eugenics, 7(2), 179-188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x