is collaborative project between Australia (Melbourne), France (Toulouse), and Canada (Vancouver). The core team includes Kim-Anh Lê Cao (University of Melbourne), Florian Rohart (Brisbane) and Sébastien Déjean (Toulouse). We also have key contributors, past (Benoît Gautier, François Bartolo) and present (Al Abadi, University of Melbourne) and several collaborators including Amrit Singh (University of British Columbia), Olivier Chapleur (INRA, Paris) – it could be you too if you wish to be involved: we host many visitors with computational, statistical and biological backgrounds!
Why multivariate methods?
It is generally admitted that single ‘omics analysis does not provide enough information to give a deep understanding of a biological system, but we can obtain a more holistic view of a system by combining multiple ‘omics analyses. Our mixOmics R package proposes a whole range of multivariate methods that we developed and validated on many biological studies to gain more insight into ‘omics biological studies.
mixOmics offers a wide range of multivariate methods for the exploration and integration of biological datasets with a particular focus on variable selection
Multivariate methods are well suited to large ‘omics data sets where the number of variables (e.g. genes, proteins, metabolites) is much larger than the number of samples (patients, cells, mice). They have the appealing properties of reducing the dimension of the data by using instrumental variables (‘components’), which are defined as combination of all variables. Those components are then used to produce useful graphical outputs that enable better understanding of the relationships and correlation structure between the different data sets that are integrated. We have developed several sparse multivariate models to identify the key variables that are highly correlated, and/or explain the biological outcome of interest (e.g. disease status). The identified variables are then more amenable to statistical inference and to posit novel biological hypotheses to be further validated in the laboratory.
Which type of data?
The data analysed with mixOmics may come from high throughput sequencing technologies, such as ‘omics data (transcriptomics, metabolomics, proteomics, microbiome/metagenomics …) but also beyond the realm of ‘omics (e.g. spectral imaging). We are currently developing new methods to integrate time-course or longitudinal omics data. Other avenues are investigated to integrate genotype data.
The methods implemented in mixOmics can also handle missing values without having to delete entire rows with missing data.
New to mixOmics?
Have a look at this webinar and this bookdown document which present our key methods, step-by-step.
Any questions or feedback? Contact us here.
mixOmics is under active development as we focus on the development of novel multivariate methods to address pressing needs for omics data integration. Register to our mailing list to make sure you are on top of the game with our latest version (devel version can be pulled on gitHub), or have a look at the NEWS posts .
About this website
This website gives a full tutorial introduction to the main mixOmics features and illustrate full multivariate analyses on some case studies. Click on the different tabs to see all options available.
Workshops
We also run regular 2 and 3-day workshops in Australia, in Europe and elsewhere. Have a look at our list of upcoming workshops. We usually advertise 3-4 months in advance with an Expression Of Interest survey to fill. Do not hesitate to contact us to run dedicated workshops for a specific data type your country at this address mixomics[at]math.univ-toulouse.fr (email for workshop enquiries only!).
The mixOmics framework today
The toolkit includes 19 multivariate methodologies today, depicted below depending on the data to integrate and the biological questions (e.g. exploration, discriminant analysis, data integration for 2 or more data sets).
The R package and key references
The mixOmics R package is organised into three main parts:
- Statistical methodologies to analyse high throughput data
- (s)PCA: (sparse) Principal Component Analysis as proposed by Shen and Huang 2008.
- (s)IPCA: independent Principal Component Analysis
- (r)CCA: (regularized) Canonical Correlation Analysis as implemented in Gonzales et al 2008.
- (s)PLS: (sparse) Partial Least Squares (regression or canonical deflations)
- (s)PLS-DA: (sparse) Partial Least Squares Discriminant Analysis
- Multilevel decomposition for repeated measurements
- mixMC for 16S multivariate analysis (see article)
- MINT for vertical multiple integration (see article)
- DIABLO for horizontal multiple integration, based on this article, but with substantial improvements, see article.
- The integrative and supervised methods in mixOmics are summarised and presented in our mixOmics article.
2. Graphical outputs to display the results and improve interpretation
- 2D and 3D sample plots, with confidence ellipses
- Relevance Networks (see article)
- Clustered Image Maps (heatmaps, see article)
- Correlation circle plots(see article)
- Arrow plots
- Circos plots for DIABLO analyses (see details here)
- Loading plots (first used here)
3. Example data sets
Each type of biological question can be answered with a specific method. This is why we provide in the package a whole range of case studies to illustrate each method.
- breast.tumor (gene expression data, with missing data)
- linnerud: very small data set for illustration of key concepts
- liver.toxicity (gene expression and clinical data, for sPLS)
- multidrug (ABC transporters and compounds, for rCCA)
- nutrimouse (gene expression and fatty acids data, for rCCA)
- srbct (gene expression data, for sPLS-DA)
- yeast (metabolites data)
- vac18 and vac18.simulated for multilevel analyses
- diverse.16S and Koren.16S for mixMC 16S analyses (similar to that paper)
- stemcells for MINT vertical multiple integration analyses
- breast.TCGA for DIABLO horizontal multiple integration analyses