Selecting your Method

This page will help you select which mixOmics method is most appropriate for your data and biological question. Use the schematics to get an overview of the method available or scroll down to read what each method can be used for.

mixOmics is appropriate for any omics data (e.g. transcriptomics, metabolomics, proteomics, microbiome/metagenomics …) but also non-omics data (e.g. spectral imaging). Input data can either be quantitative (e.g. transcriptomics data from different tumour samples) or qualitative (e.g. the classification of the tumour samples into groups).

Overview of how many quantitative and qualitative data blocks (represented by blue and orange rectangles respectively) are required for each mixOmics method.

newplot
Decision tree to aid method selection based on type of anlaysis. First identify how many datasets you have and how multiple datasets are related, shared samples are integrated by N-integration and shared variables by P-integration. Next, identify if your analysis is a classification or regression problem. Classification involves qualitative data (e.g. tumour classifications) and regression involves onlu quantitative data (e.g. transcriptomics).

A PCA type of question

one data set, unsupervised

  • What are the major trends or patterns in my data?
  • Do the samples cluster according to the biological conditions of interest?
  • Which variables contribute the most to explaining the variance in the data?

Variants such as sparse PCA (sPCA) allow for the identification of key variables that contribute to defining the principal components while Independent Principal Component Analysis (IPCA) uses ICA as a denoising process prior to principal component generation to maximise statistical independence between these components.

(s)PCA Methods Page

(s)PCA Multidrug Case Study

A PLS type of question

(two datasets, unsupervised or supervised)

  • Does the information from both datasets agree and reflect any biological condition of interest?
  • If I consider Y as response data, can I model Y given the predictor variables X?
  • What are the subsets of variables that are highly correlated and explain the major sources of variation across the data sets?

PLS maximises the covariance between data sets via latent components, which reduce the dimensions of the data. In sparse PLS (sPLS), lasso penalisation is applied on the loading vectors to identify the key variables that covary. There are two modes:

  • sPLS Regression: One dataset can be explained by another.
  • sPLS Canonical: Similar to CCA where both data sets are considered symmetrically. The difference is that PLS maximises the covariance between the datasets whereas CCA maximises the correlation.

(s)PLS Methods Page

(s)PLS Liver Toxicity Case Study

A CCA type of question

(two datasets, unsupervised)

  • Does the information from both data sets agree and reflect any biological condition of interest?
  • What is the overall correlation between them?

CCA (and its variant regularised CCA (rCCA)) achieves dimension reduction in each dataset whilst maximising similar information between the two datasets measured on the same samples. The canonical correlations inform us of the agreement between the two data sets that are projected into a smaller space spanned by the canonical variates.

( r)CCA Methods Page

( r)CCA Nutrimouse Case Study

A PLS-DA type of question

(one dataset, classification)

  • Can I discriminate samples based on their outcome category?
  • Which variables discriminate the different outcomes?
  • Can they constitute a molecular signature that predicts the class of external samples?

PLS-DA is the special case of PLS where the Y dataframe is a single, categorical variable (y). PLS-DA is used for classification by fitting a predictive model which discriminates sample groups. The variant sparse PLS-DA (sPLS-DA) includes lasso penalisation on the loading vectors to identify a subset of key variables.

(s)PLS-DA Methods Page

(s)PLS-DA SRBCT Case Study

An N−integration type of question

(several data sets, supervised or unsupervised)

  • Does the information from all data sets agree and reflect any biological condition of interest?
  • Can I discriminate samples across several data sets based on their outcome category?
  • Which variables across the different omics data sets discriminate the different outcomes?
  • Can they constitute a multi-omics signature that predicts the class of external samples?

The N-integration framework integrates several datasets measured on the same samples. There exists the multiblock sPLS method for undergoing a PLS analysis on more than two datasets. If a supervised framework is desired, there is multiblock sPLS-DA (referred to as DIABLO) for generating a predictive model for a categorical variable based on predictors from several datasets. There also exists non-sparse variants of these methods within the mixOmics package.

DIABLO Methods Page

DIABLO TCGA Case Study

A P−integration type of question

(several studies of the same omic type, supervised or unsupervised)

  • Can I combine the data sets while accounting for the variation between studies?
  • Can I discriminate the samples based on their outcome category?
  • Which variables are discriminative across all studies?
  • Can they constitute a signature that predicts the class of external samples?

The P-integration framework (refered to as MINTMultivariate INTegration) integrates several datasets measured on the same types of variables. For example, if it is genomic data then each dataframe would have the same set of genetic markers represented as variables. The supervised framework (multigroup sPLS-DA) aims to classify samples and generate a set of variables which leads to the best prediction on an external test set. The unsupervised framework (multigroup sPLS) identifies highly correlated latent components from the multiple datasets. It can also be used in a regression analysis, similar to the sPLS regression mode.

MINT Methods Page

MINT Stem Cells Case Study