DIABLO

Quick Start

library(mixOmics) # import the mixOmics library
data(breast.TCGA) # extract the TCGA data

# use the mirna, mrna and protein expression levels as predictive datasets
# note that each dataset is measured across the same individuals (samples)
X1 <- breast.TCGA$data.train$mirna
X2 <- breast.TCGA$data.train$mrna  
X3 <- breast.TCGA$data.train$protein
X <- list(mirna = X1, mrna = X2, protein = X3)

Y <- breast.TCGA$data.train$subtype # use the subtype as the outcome variable

Multiblock PLS-DA

result.diablo.tcga <- block.plsda(X, Y) # run the method
plotIndiv(result1.diablo.tcga) # plot the samples
plotVar(result1.diablo.tcga) # plot the variables

?block.plsda can be run to determine all default arguments of this function:

  • Number of components (ncomp = 2): The first two PLS components are calculated,
  • Design matrix (design = "full"): The strength of all relationships between dataframes is maximised (= 1) – a “fully connected” design,
  • PLS mode (mode = regression): A PLS regression mode is performed,
  • Scaling of the data (scale = TRUE): Each block is standardised to zero means and unit variances.

Multiblock sPLS-DA

# set the number of features to use for the X datasets
list.keepX = list(mirna = c(16, 17), mrna = c(18,5), protein = c(5, 5)) 

# run the method
result.sparse.diablo.tcga <-  block.splsda(X, Y, keepX = list.keepX) 

# plot the contributions of each feature to each dimension
plotLoadings(result.sparse.diablo.tcga, ncomp = 1) 
plotIndiv(result.sparse.diablo.tcga) # plot the samples
plotVar(result.sparse.diablo.tcga) # plot the variables

?block.splsda can be run to determine all default arguments of this function:

  • Same defaults as above for block.pls,
  • Features to retain (keepX): If unspecified, all features of the original dataframes will be used.

N-Integration Discriminant Analysis with DIABLO

DIABLO is a novel mixOmics framework for the integration of multiple data sets in a supervised analysis. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies. It can also be referred to as Multiblock (s)PLS-DA. Figure 1 depicts how this method fits into the pipeline of Omics data study.

newplot

FIGURE 1: Omics Study Pipeline.

DIABLO is the supervised approach with the mixOmics N-integrative framework models and allows users to integrate multiple datasets while explaining their relationship with a categorical outcome variable. The quantity and type of data that this method is meant to handle can be seen in Figure 2.

newplot

FIGURE 2: N-Integration Data Framework.

The DIABLO Method

In DIABLO, latent components (linear combinations of variables) are being constructed such that the sum of covariances between all pairs of datasets is maximised. All pairwise covariances are weighted as indicated by the design matrix. The response variable is transformed into a dummy variable (ie. 'one hot encoded') internally within the function. The regression sGCCA framework from the RGCCA package is utilised to deflate each of the datasets.

When it comes to predicting novel samples with a DIABLO model, one prediction per dataset is generated and these are combined in a majority vote. This can also be a weighted vote where the weights are determined by the correlation between the latent components of that dataset with the outcome.

The balance between discrimination and integration

A compromise needs to be achieved between maximising the correlation between datasets (X~1~, … X~Q~) and maximising the discriminative ability of the resulting model on the outcome y. This translates to the values used in the design matrix. A value between 0.5 and 1 will prioritise the correlation between two given datasets, while a value lower than this range will prioritise the predictive ability of the model (Singh et al., 2019). How this matrix is constructed depends on both prior knowledge (eg. “I expect that mRNA data and miRNA data will be high correlated”) and data-driven conclusions (generated by prior exploration of the data using non-N-integrative methods such as (s)PLS). Construction of the design matrix is explained further in the N-Integrative Methods page.

Additional Notes

Our manuscript is a collaborative work between the core team (Dr Florian Rohart, Dr Kim-Anh Lê Cao), and key contributors (Dr Amrit Singh, Benoît Gautier) as a result of a long-term collaboration with the University of British Columbia.

We are also investigating an *N-*integration method based on kernels (see our example here with mixKernel), which has currently been developed for unsupervised analysis.

Further Reading

The mixOmics DIABLO methodology has been applied in real research contexts. A few examples can be seen below:

References

  1. Singh, A., Shannon, C. P., Gautier, B., Rohart, F., Vacher, M., Tebbutt, S. J., and Lê Cao, K.-A. (2019). Diablo: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics,35(17):3055–3062.