Quick Start
library(mixOmics) # import the mixOmics library
data(breast.TCGA) # extract the TCGA data
# use the mirna, mrna and protein expression levels as predictive datasets
# note that each dataset is measured across the same individuals (samples)
X1 <- breast.TCGA$data.train$mirna
X2 <- breast.TCGA$data.train$mrna
X3 <- breast.TCGA$data.train$protein
X <- list(mirna = X1, mrna = X2, protein = X3)
Y <- breast.TCGA$data.train$subtype # use the subtype as the outcome variable
Multiblock PLS-DA
result.diablo.tcga <- block.plsda(X, Y) # run the method
plotIndiv(result1.diablo.tcga) # plot the samples
plotVar(result1.diablo.tcga) # plot the variables
?block.plsda
can be run to determine all default arguments of this function:
- Number of components (
ncomp = 2
): The first two PLS components are calculated, - Design matrix (
design = "full"
): The strength of all relationships between dataframes is maximised (= 1) – a “fully connected” design, - PLS mode (
mode = regression
): A PLS regression mode is performed, - Scaling of the data (
scale = TRUE
): Each block is standardised to zero means and unit variances.
Multiblock sPLS-DA
# set the number of features to use for the X datasets
list.keepX = list(mirna = c(16, 17), mrna = c(18,5), protein = c(5, 5))
# run the method
result.sparse.diablo.tcga <- block.splsda(X, Y, keepX = list.keepX)
# plot the contributions of each feature to each dimension
plotLoadings(result.sparse.diablo.tcga, ncomp = 1)
plotIndiv(result.sparse.diablo.tcga) # plot the samples
plotVar(result.sparse.diablo.tcga) # plot the variables
?block.splsda
can be run to determine all default arguments of this function:
- Same defaults as above for
block.pls
, - Features to retain (
keepX
): If unspecified, all features of the original dataframes will be used.
N-Integration Discriminant Analysis with DIABLO
DIABLO is a novel mixOmics
framework for the integration of multiple data sets in a supervised analysis. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies. It can also be referred to as Multiblock (s)PLS-DA. Figure 1 depicts how this method fits into the pipeline of Omics data study.
FIGURE 1: Omics Study Pipeline.
DIABLO is the supervised approach with the mixOmics
N-integrative framework models and allows users to integrate multiple datasets while explaining their relationship with a categorical outcome variable. The quantity and type of data that this method is meant to handle can be seen in Figure 2.
FIGURE 2: N-Integration Data Framework.
The DIABLO Method
In DIABLO, latent components (linear combinations of variables) are being constructed such that the sum of covariances between all pairs of datasets is maximised. All pairwise covariances are weighted as indicated by the design matrix. The response variable is transformed into a dummy variable (ie. 'one hot encoded') internally within the function. The regression sGCCA framework from the RGCCA
package is utilised to deflate each of the datasets.
When it comes to predicting novel samples with a DIABLO model, one prediction per dataset is generated and these are combined in a majority vote. This can also be a weighted vote where the weights are determined by the correlation between the latent components of that dataset with the outcome.
The balance between discrimination and integration
A compromise needs to be achieved between maximising the correlation between datasets (X~1~, … X~Q~) and maximising the discriminative ability of the resulting model on the outcome y. This translates to the values used in the design
matrix. A value between 0.5 and 1 will prioritise the correlation between two given datasets, while a value lower than this range will prioritise the predictive ability of the model (Singh et al., 2019). How this matrix is constructed depends on both prior knowledge (eg. “I expect that mRNA data and miRNA data will be high correlated”) and data-driven conclusions (generated by prior exploration of the data using non-N-integrative methods such as (s)PLS). Construction of the design matrix is explained further in the N-Integrative Methods page.
Additional Notes
Our manuscript is a collaborative work between the core team (Dr Florian Rohart, Dr Kim-Anh Lê Cao), and key contributors (Dr Amrit Singh, Benoît Gautier) as a result of a long-term collaboration with the University of British Columbia.
We are also investigating an *N-*integration method based on kernels (see our example here with mixKernel), which has currently been developed for unsupervised analysis.
Further Reading
The mixOmics
DIABLO methodology has been applied in real research contexts. A few examples can be seen below:
- Lee, A.H., Shannon, C.P., Amenyogbe, N. et al. Dynamic molecular changes during the first week of human life follow a robust developmental trajectory. Nat Commun 10, 1092 (2019)
- Gavin PG, Mullaney JA, Loo D, Cao KL, Gottlieb PA, Hill MM, Zipris D, Hamilton-Williams EE. Intestinal Metaproteomics Reveals Host-Microbiota Interactions in Subjects at Risk for Type 1 Diabetes. Diabetes Care. 2018 Oct;41(10):2178-2186