# N-Integration discriminant analysis with mixDiablo

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumors can be classified into several sub types, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing). The aim of our N-integration analysis is to identify a highly correlated multi-‘omics signature discriminating the breast cancer subtypes subgroups Basal, Her2 and LumA.

The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. In the example below illustrate an analysis on a smaller data set that is stored in the mixOmics package.

## To begin…

library(mixOmics)


The data for mixDIABLO need to be set up as a list of data matrices matching the same samples. The blocks: are ‘omics data sets, where each row matches to the same biological sample from one data set to another. The outcomeY is set as a factor for the supervised analysis. Here Y corresponds to the PAM50 classification.

data('breast.TCGA')
# extract training data
data = list(mrna = breast.TCGA$data.train$mrna,
mirna = breast.TCGA$data.train$mirna,
prot = breast.TCGA$data.train$protein)

# check dimension
lapply(data, dim)

## $mrna ## [1] 150 200 ## ##$mirna
## [1] 150 184
##
## $prot ## [1] 150 142  # outcome is a factor Y = breast.TCGA$data.train$subtype summary(Y)  ## Basal Her2 LumA ## 45 30 75  # Parameter choice The matrix design determines which blocks should be connected to maximize the correlation or covariance between components. the values may range between 0 (no correlation) to 1 (correlation to maximize) and is a symmetrical matrix. The design can be chosen based on prior knowledge (‘I expect mRNA and miRNA to be highly correlated’) or data-driven (e.g. based on a prior analysis, such as a non sparse analysis with \texttt{block.pls} to examine the correlation between the different blocks via the modelled components). Our experience has shown that a compromise between maximising the correlation between blocks, and discriminating the outcome needed to be achieved, and that the weight in the design matrix could be set to < 1 between blocks (see an example in our vignette and information in supplemental material in [4]). Here we illustrate a ‘full’ design, where all blocks are connected. We set the number of components to K-1, where K is the number of categories in the Y outcome. ncomp = 2 design = matrix(1, ncol = length(data), nrow = length(data), dimnames = list(names(data), names(data))) diag(design) = 0 design  ## mrna mirna prot ## mrna 0 1 1 ## mirna 1 0 1 ## prot 1 1 0  # Tuning This tuning function should be used to tune the keepX parameters in the block.splsda function. We choose the optimal number of variables to select in each data set using the tune() function, for a grid of keepX values. Note that the function has been set to favor the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation. See ?tune.block.splsda. test.keepX = list ("mrna" = c(5:9, seq(10, 18, 2), seq(20,30,5)), "mirna" = c(5:9, seq(10, 18, 2), seq(20,30,5)), "prot" = c(5:9, seq(10, 18, 2), seq(20,30,5))) tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp, test.keepX = test.keepX, design = design)  The optimal number of variables to select in each data set and each component is indicated in list.keepX from the tuning step above. Alternatively, you can manually input those parameters as indicated below. list.keepX = list("mrna" = c(7,12), "mirna" = c(5,25), "prot" = c(5,16)) # from tuning step  # Discriminant multiblock model (block.splsda) The final mixDIABLO model is run as: sgccda.res = block.splsda(X = data, Y = Y, ncomp = ncomp, keepX = list.keepX, design = design) #sgccda.res # list the different functions of interest related to that object  The warning message informs that the outcome Y has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximized, as shown in the final design output: sgccda.res$design

##       mrna mirna prot Y
## mrna     0     1    1 1
## mirna    1     0    1 1
## prot     1     1    0 1
## Y        1     1    1 0


The selected variables can be extracted with the function selectVar(), for example in the mRNA block:

# mrna variables selected on component 1
selectVar(sgccda.res, block = 'mrna', comp = 1)$mrna$name

## [1] "ZNF552"  "KDM4B"   "FUT8"    "CCNA2"   "LRIG1"   "C4orf34" "PREX1"


# Sample plots

plotDIABLO is a diagnostic plot to check whether the correlation between components from each data set has been maximized as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.

plotDiablo(sgccda.res, ncomp = 1)


The first components from each data set are highly correlated to each others. The colors and ellipses related to the sample sub types and indicate the discriminating power of each component to separate the different tumor sub types.

The sample plot with the plotIndiv function projects each sample into the space spanned by the components of each block. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE):

plotIndiv(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')


In the arrow plot below, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrows the location of that same sample in each block. Such graphic highlight the agreement between all data sets at the sample level, when modeled with DIABLO.

plotArrow(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')


# Variable plots

Several graphical outputs are available to visualize and mine the associations between the selected variables.

The correlation circle plot highlights the contribution of each selected variable to each component, see [3]. plotVar displays the variables from all blocks, selected on component 1 and 2 (see here for more examples). Clusters of points indicate a strong correlation between variables. For better visibility we choose to hide the variable names:

plotVar(sgccda.res, var.names = FALSE, style = 'graphics', legend = TRUE,
pch = c(16, 17, 15), cex = c(2,2,2), col = c('blue', 'red2', 'darkgreen'))


The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix, extended to the case of multiple data sets from [3].

circosPlot(sgccda.res, cutoff = 0.7, line = FALSE)


Another visualization of the correlation between the different types of variables is the relevance network, that is also built on the similarity matrix. Each color represents a type of variable.

network(sgccda.res, blocks = c(1,2,3), cex.node.name = 0.6,
color.node = c('lightblue', 'red2', 'lightgreen'))


plotLoadings()* visualizes the loading weights of each selected variables on each component and each data set. The color indicates the class in which the variable has the maximum level of expression (contrib = ‘max’), on average (method = ‘mean’) or using the median (method = ‘median’).

plotLoadings(sgccda.res, comp = 2, contrib = 'max', method = 'median')


The cimDIABLO() function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample.

cimDiablo(sgccda.res)


# Performance of the model

We assess the performance of the model using 10 x 10-fold cross-validation using the function perf(). The method runs a block.splsda model on the pre-specified arguments input from our sgccda.res object but on cross-validated samples. We then assess the accuracy of the prediction on the left out samples.

In addition to the usual (balanced) classification error rates, predicted scores and stability of the selected features, perf() for DIABLO outputs the performance based on Majority Vote (each data set votes for a prediction for a particular test sample) or the Average Prediction (the prediction scores are continuous values averaged across all data sets).

perf.diablo = perf(sgccda.res, validation = 'Mfold', M = 10,
dist = 'mahalanobis.dist', nrepeat = 10)
#perf.diablo  # lists the different outputs

# Performance with Majority vote
kable(perf.diablo$MajorityVote.error.rate)  comp 1 comp 2 Basal 0.0311111 0.0466667 Her2 1.0000000 0.6900000 LumA 0.0026667 0.0026667 Overall.ER 0.2106667 0.1533333 Overall.BER 0.3445926 0.2464444 # Performance with Weighted prediction kable(perf.diablo$WeightedPredict.error.rate)

comp 1 comp 2
Basal 0.0422222 0.0755556
Her2 1.0000000 0.6800000
LumA 0.0160000 0.0226667
Overall.ER 0.2206667 0.1700000
Overall.BER 0.3527407 0.2594074

# Prediction on an external test set

The predict() function predicts the class of samples from a test set. In our specific case on data set is missing in the test set.

# prepare test set data: here one block (proteins) is missing
data.test.TCGA = list(mrna = breast.TCGA$data.test$mrna,
mirna = breast.TCGA$data.test$mirna)

predict.diablo = predict(sgccda.res, newdata = data.test.TCGA, dist = 'mahalanobis.dist')
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs

# the confusion table compares the real subtypes with the predicted subtypes for a 2 component model
table(predict.diablo$MajorityVote$mahalanobis.dist[,2], breast.TCGA$data.test$subtype)

##
##         Basal Her2 LumA
##   Basal    17    2    0
##   Her2      1    6    1
##   LumA      0    0   28


# More details?

Have a look at our reference [4] and [5] and the supplemental of [5] in particular!