Partial Least Squares regression (PLS)
Partial Least Squares (PLS) regression (Wold, 1966; Wold et al., 2001) is a multivariate methodology which relates (integrates) two data matrices X (e.g. transcriptomics) and Y (e.g. metabolites). PLS goes beyond traditional multiple regression by modelling the structure of both matrices. Unlike traditional multiple regression models, it is not limited to uncorrelated variables. One of the many advantages of PLS is that it can handle many noisy, collinear (correlated) and missing variables and can also simultaneously model several response variables Y. It is especically efficient when p + q >> n .
The flexibility of the PLS variants enables us to address different types of analytical questions, they include:
Regression mode: this is to model a causal relationship between the two data sets, i.e. PLS will predict Y from X
Canonical mode: similar to CCA, this mode is used to model a bi-directional relationship between the two data sets
Invariant mode: performs a redundancy analysis (the Y matrix is not deflated)
Classic mode: is the classical PLS as proposed in Tenenhaus (1998)
Sparse Partial Least Squares regression (sPLS)
Even though PLS is highly efficient in the high dimensional context, interpretability is needed to get more insight into the biological study. sPLS has been recently developed by our team to perform simultaneous variable selection in the two data sets (Lê Cao et al., 2008). Variable selection is achieved by introducing LASSO penalization on the pair of loading vectors. Both mode = regression and mode = canonical are available. In addition to the number of dimensions or components ncomp to choose, the user will have to specify the number of variables to selection on each dimension and for each data set KeepX, KeepY. In the complex case of highly dimensional omics data sets, the proposed statistical criteria may not be satisfactory enough to address the biological question. Sometimes it is best that the user chooses the number of variables to select based on intuition and the posterior biological interpretation of the results.
Usage in mixOmics
Tuning the number of components
The perf() function performs cross-validation, as specified by the arguments validation which performs cross-validation or leave-one-out to compute the MSEP, R2 and Q2. The rule of thumbs is that a PLS component should be included in the model if its value is greater than or equal to 0.0975 (Tenenhaus, 1998). Tuning criteria are mostly available for regression mode. However, it is advised to follow your intuition, backed up with the downstream biological interpretation of the results (gene ontology, functional analyses, etc).
PLS is implemented in mixOmics via the functions pls() and spls() as displayed below:
library(mixOmics) data(liver.toxicity) X <- liver.toxicity$gene Y <- liver.toxicity$clinic #PLS result <- pls(X, Y, ncomp = 3) # where ncomp is the number of dimensions/components to choose tune.pls <- perf(result, validation = 'loo', criterion = 'all', progressBar = FALSE) # SPLS ncomp = 10 result.spls <- spls(X, Y, ncomp = ncomp, keepX = c(rep(10, ncomp)), mode = 'regression') tune.spls <- perf(result.spls, validation = 'Mfold', folds = 10, criterion = 'all', progressBar = FALSE)
Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1), 37–52.
Wold, S., Sjöström, M., and Eriksson, L. (2001). Pls-regression: a basic tool of chemometrics. Chemometrics and intelligent laboratory systems, 58(2), 109–130.
Tenenhaus M. (1998) La régression PLS: théorie et pratique. Paris: Editions Technic.
Wold H. (1966) Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P.R. (editors). Multivariate Analysis. Academic Press, N.Y., pp 391-420.
Lê Cao K.-A., Rossouw D., Robert-Granié C. and Besse P. (2008) A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.