library(mixOmics) # import the mixOmics library
data(breast.TCGA) # extract the TCGA data
# use the mirna, mrna and protein expression levels as predictive datasets
# note that each dataset is measured across the same individuals (samples)
X1 <- breast.TCGA$data.train$mirna
X2 <- breast.TCGA$data.train$mrna
X3 <- breast.TCGA$data.train$protein
X <- list(mirna = X1, mrna = X2, protein = X3)
Y <- breast.TCGA$data.train$subtype # use the subtype as the outcome variable
result.diablo.tcga <- block.plsda(X, Y) # run the method
plotIndiv(result1.diablo.tcga) # plot the samples
plotVar(result1.diablo.tcga) # plot the variables
?block.plsda
can be run to determine all default
arguments of this function:
ncomp = 2
): The first two PLS
components are calculated,design = "full"
): The strength of all
relationships between dataframes is maximised (= 1) - a “fully
connected” design,mode = regression
): A PLS regression mode is
performed,scale = TRUE
): Each block is
standardised to zero means and unit variances.# set the number of features to use for the X datasets
list.keepX = list(mirna = c(16, 17), mrna = c(18,5), protein = c(5, 5))
# run the method
result.sparse.diablo.tcga <- block.splsda(X, Y, keepX = list.keepX)
# plot the contributions of each feature to each dimension
plotLoadings(result.sparse.diablo.tcga, ncomp = 1)
plotIndiv(result.sparse.diablo.tcga) # plot the samples
plotVar(result.sparse.diablo.tcga) # plot the variables
?block.splsda
can be run to determine all default
arguments of this function:
block.pls
,keepX
): If unspecified, all
features of the original dataframes will be used.DIABLO is a novel mixOmics
framework
for the integration of multiple data sets in a supervised analysis.
DIABLO stands for Data Integration
Analysis for Biomarker discovery using
Latent variable approaches for `Omics
studies. It can also be referred to as Multiblock (s)PLS-DA. Figure 1
depicts how this method fits into the pipeline of Omics data study.
FIGURE 1: Omics Study Pipeline.
DIABLO is the supervised approach with the mixOmics
N-integrative framework models and allows users to integrate multiple
datasets while explaining their relationship with a categorical outcome
variable. The quantity and type of data that this method is meant to
handle can be seen in Figure 2.
FIGURE 2: N-Integration Data Framework.
In DIABLO, latent components (linear combinations of variables) are
being constructed such that the sum of covariances between all pairs of
datasets is maximised. All pairwise covariances are weighted as
indicated by the design matrix. The response variable is transformed
into a dummy variable (ie. ‘one hot encoded’) internally within the
function. The regression sGCCA framework from the RGCCA
package is utilised to deflate each of the datasets.
When it comes to predicting novel samples with a DIABLO model, one prediction per dataset is generated and these are combined in a majority vote. This can also be a weighted vote where the weights are determined by the correlation between the latent components of that dataset with the outcome.
A compromise needs to be achieved between maximising the correlation
between datasets (X1, …
XQ) and maximising the discriminative
ability of the resulting model on the outcome y. This
translates to the values used in the design
matrix. A value
between 0.5 and 1 will prioritise the correlation between two given
datasets, while a value lower than this range will prioritise the
predictive ability of the model (Singh et al., 2019). How this matrix is
constructed depends on both prior knowledge (eg. “I expect that mRNA
data and miRNA data will be high correlated”) and data-driven
conclusions (generated by prior exploration of the data using
non-N-integrative methods such as (s)PLS). Construction of the design
matrix is explained further in the N-Integrative Methods page.