sPLS

Partial Least Squares regression (PLS)

Partial Least Squares (PLS) regression (Wold, 1966; Wold et al., 2001) is a multivariate methodology which relates (integrates) two data matrices X (e.g. transcriptomics) and Y (e.g. metabolites). PLS goes beyond traditional multiple regression by modelling the structure of both matrices. Unlike traditional multiple regression models, it is not limited to uncorrelated variables. One of the many advantages of PLS is that it can handle many noisy, collinear (correlated) and missing variables, and can also simultaneously model several response variables Y. It is especially efficient when p + q >> n .

The flexibility of the PLS variants enables us to address different types of analytical questions, they include:

PLS Modes

Different modes relate on how the Y matrix is deflated across the iterations of the algorithms – i.e. the different components.

  • Regression mode: the Y matrix is deflated with respect to the information extracted/modelled from the local regression on X. Here the goal is to predict Y from X (Y and X play an assymetric role). Consequently the latent variables computed to predict Y from X are different from those computed to preict X from Y.

  • Canonical mode: the Y matrix is deflated to the information extracted/modelled from the local regression on Y. Here X and Y play a symetric role and the goal is similar to a Canonical Correlation type of analysis.

  • Invariant mode: the Y matrix is not deflated and we perform a redundancy analysis.

  • Classic mode: is similar to a regression mode. It gives identical results for the variates and loadings associated to the X data set, but differences for the loadings vectors associated to the Y data set (different normalisations are used). Classic mode is the PLS2 model as defined by Tenenhaus (1998), Chap 9.

Note that in all cases the results are the same on the first component as deflation only starts after component 1.

Sparse Partial Least Squares regression (sPLS)

Even though PLS is highly efficient in the high dimensional context, interpretability is needed to get more insight into the biological study. sPLS has been recently developed by our team to perform simultaneous variable selection in the two data sets (Lê Cao et al., 2008). Variable selection is achieved by introducing LASSO penalization on the pair of loading vectors. Both mode = regression and mode = canonical are available. In addition to the number of dimensions or components ncomp to choose, the user has to specify the number of variables to select on each dimension and for each data set keepX, keepY. In the complex case of highly dimensional omics data sets, the proposed statistical criteria may not be satisfactory enough to address the biological question. Sometimes it might be best to choose the number of variables to select based on intuition and the posterior biological interpretation of the results (including our various graphical plots).

Usage in mixOmics

Tuning the number of components

The perf function performs cross-validation, as specified by the argument validation which performs cross-validation (CV) or leave-one-out CV to compute the MSEP, R2 and Q2. The rule of thumb is that a PLS component should be included in the model if its value is greater than or equal to 0.0975 (Tenenhaus, 1998). Tuning criteria are mostly available for regression mode. However, it is advised to follow your intuition, backed up with the downstream biological interpretation of the results (gene ontology, functional analyses, etc) as those criteria are actually quite tricky to tune, depending on the data to integrate.

PLS is implemented in mixOmics via the functions pls and spls as displayed below:

library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

#PLS
result <- pls(X, Y, ncomp = 3)  # where ncomp is the number of dimensions/components to choose
tune.pls <- perf(result, validation = 'loo', criterion = 'all', progressBar = FALSE)

# SPLS
ncomp = 10
result.spls <- spls(X, Y, ncomp = ncomp, keepX = c(rep(10, ncomp)), mode = 'regression')
tune.spls <- perf(result.spls, validation = 'Mfold', folds = 10,
                  criterion = 'all', progressBar = FALSE)

Case study

See Case Study: sPLS Liver Toxicity

References

PLS

Abdi H (2010). Partial least squares regression and projection on latent structure regression (PLS Regression). \emph{Wiley Interdisciplinary Reviews: Computational Statistics}, 2(1), 97-106.

Tenenhaus M. (1998) La régression PLS: théorie et pratique. Paris: Editions Technic.

Geladi P. and Kowalski B.R. (1986) Partial Least Squares Regression: A Tutorial. Analytica Chimica Acta 185, pp 1-17.

Wold H. (1966) Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P.R. (editors). Multivariate Analysis. Academic Press, N.Y., pp 391-420.

Wold, S., Sjöström, M., and Eriksson, L. (2001). Pls-regression: a basic tool of chemometrics. Chemometrics and intelligent laboratory systems, 58(2), 109–130.

sPLS

Lê Cao K.-A., Martin P.G.P., Robert-Granié C. and Besse P. (2009) Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10(34).

Lê Cao K.-A., Rossouw D., Robert-Granié C. and Besse P. (2008) A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.

Shen H. and Huang J.Z. (2008) Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, pp 1015-1034.