library(mixOmics) # import the mixOmics library data(breast.TCGA) # extract the TCGA data # use the mirna, mrna and protein expression levels as predictive datasets # note that each dataset is measured across the same individuals (samples) X1 <- breast.TCGA$data.train$mirna X2 <- breast.TCGA$data.train$mrna X3 <- breast.TCGA$data.train$protein X <- list(mirna = X1, mrna = X2, protein = X3) Y <- breast.TCGA$data.train$subtype # use the subtype as the outcome variable
result.diablo.tcga <- block.plsda(X, Y) # run the method plotIndiv(result1.diablo.tcga) # plot the samples plotVar(result1.diablo.tcga) # plot the variables
?block.plsda can be run to determine all default arguments of this function:
- Number of components (
ncomp = 2): The first two PLS components are calculated,
- Design matrix (
design = "full"): The strength of all relationships between dataframes is maximised (= 1) – a “fully connected” design,
- PLS mode (
mode = regression): A PLS regression mode is performed,
- Scaling of the data (
scale = TRUE): Each block is standardised to zero means and unit variances.
# set the number of features to use for the X datasets list.keepX = list(mirna = c(16, 17), mrna = c(18,5), protein = c(5, 5)) # run the method result.sparse.diablo.tcga <- block.splsda(X, Y, keepX = list.keepX) # plot the contributions of each feature to each dimension plotLoadings(result.sparse.diablo.tcga, ncomp = 1) plotIndiv(result.sparse.diablo.tcga) # plot the samples plotVar(result.sparse.diablo.tcga) # plot the variables
?block.splsda can be run to determine all default arguments of this function:
- Same defaults as above for
- Features to retain (
keepX): If unspecified, all features of the original dataframes will be used.
N-Integration Discriminant Analysis with DIABLO
DIABLO is a novel
mixOmics framework for the integration of multiple data sets in a supervised analysis. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies. It can also be referred to as Multiblock (s)PLS-DA. Figure 1 depicts how this method fits into the pipeline of Omics data study.
FIGURE 1: Omics Study Pipeline.
DIABLO is the supervised approach with the
mixOmics N-integrative framework models and allows users to integrate multiple datasets while explaining their relationship with a categorical outcome variable. The quantity and type of data that this method is meant to handle can be seen in Figure 2.
FIGURE 2: N-Integration Data Framework.
The DIABLO Method
In DIABLO, latent components (linear combinations of variables) are being constructed such that the sum of covariances between all pairs of datasets is maximised. All pairwise covariances are weighted as indicated by the design matrix. The response variable is transformed into a dummy variable (ie. 'one hot encoded') internally within the function. The regression sGCCA framework from the
RGCCA package is utilised to deflate each of the datasets.
When it comes to predicting novel samples with a DIABLO model, one prediction per dataset is generated and these are combined in a majority vote. This can also be a weighted vote where the weights are determined by the correlation between the latent components of that dataset with the outcome.
The balance between discrimination and integration
A compromise needs to be achieved between maximising the correlation between datasets (X~1~, … X~Q~) and maximising the discriminative ability of the resulting model on the outcome y. This translates to the values used in the
design matrix. A value between 0.5 and 1 will prioritise the correlation between two given datasets, while a value lower than this range will prioritise the predictive ability of the model (Singh et al., 2019). How this matrix is constructed depends on both prior knowledge (eg. “I expect that mRNA data and miRNA data will be high correlated”) and data-driven conclusions (generated by prior exploration of the data using non-N-integrative methods such as (s)PLS). Construction of the design matrix is explained further in the N-Integrative Methods page.
Our manuscript is a collaborative work between the core team (Dr Florian Rohart, Dr Kim-Anh Lê Cao), and key contributors (Dr Amrit Singh, Benoît Gautier) as a result of a long-term collaboration with the University of British Columbia.
We are also investigating an *N-*integration method based on kernels (see our example here with mixKernel), which has currently been developed for unsupervised analysis.
mixOmics DIABLO methodology has been applied in real research contexts. A few examples can be seen below:
- Lee, A.H., Shannon, C.P., Amenyogbe, N. et al. Dynamic molecular changes during the first week of human life follow a robust developmental trajectory. Nat Commun 10, 1092 (2019)
- Gavin PG, Mullaney JA, Loo D, Cao KL, Gottlieb PA, Hill MM, Zipris D, Hamilton-Williams EE. Intestinal Metaproteomics Reveals Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes. Diabetes Care. 2018 Oct;41(10):2178-2186