DIABLO is our novel mixOmics framework for the integration of multiple data sets in a supervised analysis. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for ‘Omics studies.
The integration of multiple ‘omics data sets is topical to study biological systems. Such analysis aims to extract complementary information from several data sets measured on the same N individuals (or biological samples) but across multiple data sets generated using different (omics) platforms, to gain a better understanding of the interplay between the different levels of data that are measured. We call our generic framework ‘N-integration’ (as opposed to P-integration, see MINT).
The core DIABLO method builds on the Generalised Canonical Correlation Analysis , which contrary to what its name suggests, generalises PLS for multiple matching datasets, and the sparse sGCCA method . Starting from the R package RGCCA from Tenenhaus et al, we subtantially improved the method and code for different types of analyses, including unsupervised N-integration ( block.pls, block.spls) and supervised analyses ( block.plsda, block.splsda).
The aim of N-integration with our sparse methods is to identify correlated (or co-expressed) variables measured on heterogeneous data sets which also explain the categorical outcome of interest (supervised analysis). The multiple data integration task is not trivial, as the analysis can be strongly affected by the variation between manufacturers or ‘omics technological platforms despite being measured on the same biological samples. In addition, to avoid a ‘fishing omics expedition’, it is better to analyse each data separately first to understand well where the major sources of variation come from, and also to guide specifically the integration process.
In a preliminary study  we integrated transcriptomics and proteomics data in a kidney rejection study, while explaining the rejection status of the patients.
We further extended sGCCA for Discriminant Analysis, substantially improved the R code and developed innovative graphical outputs to interpret the DIABLO results, which we present in the next page (see tab).
Our manuscript is a collaborative work between the core team (Drs Florian Rohart, Kim-Anh Lê Cao), key contributors (Dr Amrit Singh, Benoît Gautier) as a result of a long term collaboration with University of British Columbia.
What is next?
The manuscript is still in progress as we are including additional examples.
We are also investigating an N-integration method based on kernels (see our example here with mixKernel), which has currently been developed for unsupervised analysis.
Feel free to contact us at mixomics [at] math.univ-toulouse.fr if you have any questions or comments.
- Tenenhaus A and Tenenhaus M. Regularized generalized canonical correlation analysis. Psychometrika, 76(2):257–284, 2011.
- Tenenhaus A, Philippe C, Guillemot V, Lê Cao, K. A., Grill J, and Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics, page kxu001, 2014.
- Günther OP, Shin H, Ng RT, McMaster WR, McManus BM, Keown PA, Tebbutt SJ, and Lê Cao, K. A.. Novel multivariate methods for integration of genomics and proteomics data: applications in a kidney transplant rejection study. Omics: a journal of integrative biology, 18(11):682–695, 2014.
- Singh, A., Gautier, B., Shannon, C.P., Vacher, M., Rohart, F., Tebutt, S.J. and Le Cao, K.A., 2016. DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv, p.067611.
- Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for ‘omics feature selection and multiple data integration.