MINT

MINT-website We introduce our novel Multivariate INTegrative method, MINT (or mixMINT as a module for mixOmics) that focuses on combining and integrating independent studies measured on the same P predictors (e.g. genes) and call this framework P-integration.

Context

P-integration of homogeneous ‘omics data  combines studies generated under similar biological conditions, but systematic differences arise due to being assayed at different geographical sites and/or at different times. The effect of this systematic, unwanted variation may outweigh that of interesting biological variation between individuals and molecules, decreasing statistical power and leading to spurious results and conclusions. However, if successful, a P-integrative analysis allows not only to substantially increase sample size and statistical power, but also enable data sharing across research communities and re-use of existing data deposited in public databases while identifying a reproducible biomarker signature.

Background

Previously, we extended sPLS-DA to combine independent transcriptomics studies and to identify a gene signature defining human Mesenchymal Stromal Cells (MSC) [1]. This is a topical question in stem cell biology, as MSCs are a poorly defined group of stromal cells despite their increasingly recognized clinical importance. In that first study we integrated 84 highly curated public gene expression data sets representing 125 MSC and 510 non-MSC spanning across 13 different microarray platforms, but all measuring gene expression levels. We used YuGene normalisation [2] combined with an improved sPLS-DA using extensive subsampling to avoid overfitting and ensure a robust and reproducible gene signature. The resulting agnostic platform signature of 16 genes gave an impressive classification accuracy of 97.8% on the training set, and 93.5% on an external test set (187 MSC and 474 non-MSC). The MSC molecular signature predictor is available in the Stemformatics web resource, an R package ‘bootsPLS‘ is also available on CRAN. The molecular signature has brought novel insights into the origin and function of MSC and it can be considered as a more accurate alternative to current immunophenotyping methods. Our signature predictor is currently being used and validated by many of our stem cell collaborators through Stemformatics. To our knowledge, no such comprehensive integrative study has ever been performed for biological classification problems.

MINT method

Extending on this successful study, MINT extends the bootPLS and the multig-group PLS [3] approach to include a study/group structure in the model, thus avoiding extensive subsampling. MINT is also a sparse multivariate method and identifies a subset of variables. Two frameworks are available:

  • a supervised framework where the aim is classify samples and identify a set of discriminative markers leading to accurate class predicting in external test sets. In that case X is the combined data matrix (N x P)  and Y is a combined vector indicating the class membership of each sample.
  • an unsupervised regression framework where the aim is to integrate the data matrix X with a quantitative matrix Y (think of a PLS-regression mode) and identify either correlated variables from both X and Y, or variables from X that best explain Y.

mixMINT performs P-integration: integration across the same P predictors, but independent M studies.

The next page (click on the tab, or here) illustrate the use of MINT when combining transcriptomics data sets.

Our manuscript is currently in submission [4] and is a collaborative work between the core team (Drs Florian Rohart, Kim-Anh Lê Cao) and French researchers (Drs Aida Eslami, Stephanie Bougeard). Feel free to contact us at mixomics [at] math.univ-toulouse.fr if you have any questions.

References

[1] Rohart, F., Mason, E. A., Matigian, N., Mosbergen, R., Korn, O., Chen, T., Butcher, S., Patel, J., Atkinson, K., Khosrotehrani, K., Fisk, N. M., Lê Cao, K., and Wells, C. A. (2016). A molecular classification of human mesenchymal stromal cells. PeerJ, 4, e1845.

[2] Lê Cao, K-A., Rohart, F., McHugh, L., Korm, O., and Wells, C. A. (2014). YuGene: A simple approach to scale gene expression data derived from different platforms for integrated analyses. Genomics, 103, 239–251.

[3] Eslami, A., Qannari, E. M., Kohler, A., and Bougeard, S. (2014). Algorithms for multi-group PLS. J. Chemometrics, 28(3), 192–201.

[4] Rohart F.,  Matigian N., Eslami A., Bougeard S and Lê Cao, K-A (2017).MINT: A multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms BMC Bioinformatics 18:128