Example of mixDIABLO analysis

N-Integration discriminant analysis with mixDiablo

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumors can be classified into several sub types, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing). The aim of our N-integration analysis is to identify a highly correlated multi-‘omics signature discriminating the breast cancer subtypes subgroups Basal, Her2 and LumA.

The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. In the example below illustrate an analysis on a smaller data set that is stored in the mixOmics package.

To begin…

Load the latest version of mixOmics (chek the version!).

library(mixOmics)

The data for mixDIABLO need to be set up as a list of data matrices matching the same samples. The blocks: are ‘omics data sets, where each row matches to the same biological sample from one data set to another. The outcomeY is set as a factor for the supervised analysis. Here Y corresponds to the PAM50 classification.

data('breast.TCGA')
# extract training data
data = list(mrna = breast.TCGA$data.train$mrna, 
            mirna = breast.TCGA$data.train$mirna, 
            prot = breast.TCGA$data.train$protein)

# check dimension
lapply(data, dim)
## $mrna
## [1] 150 200
## 
## $mirna
## [1] 150 184
## 
## $prot
## [1] 150 142
# outcome is a factor
Y = breast.TCGA$data.train$subtype
summary(Y)
## Basal  Her2  LumA 
##    45    30    75

Parameter choice

The matrix design determines which blocks should be connected to maximize the correlation or covariance between components. the values may range between 0 (no correlation) to 1 (correlation to maximize) and is a symmetrical matrix. The design can be chosen based on prior knowledge (‘I expect mRNA and miRNA to be highly correlated’) or data-driven (e.g. based on a prior analysis, such as a non sparse analysis with \texttt{block.pls} to examine the correlation between the different blocks via the modelled components).
Our experience has shown that a compromise between maximising the correlation between blocks, and discriminating the outcome needed to be achieved, and that the weight in the design matrix could be set to < 1 between blocks (see an example in our vignette and information in supplemental material in [4]).

Here we illustrate a ‘full’ design, where all blocks are connected. We set the number of components to K-1, where K is the number of categories in the Y outcome.

ncomp = 2
design = matrix(1, ncol = length(data), nrow = length(data), dimnames = list(names(data), names(data)))
diag(design) = 0
design 
##       mrna mirna prot
## mrna     0     1    1
## mirna    1     0    1
## prot     1     1    0

Tuning

This tuning function should be used to tune the keepX parameters in the block.splsda function.

We choose the optimal number of variables to select in each data set using the tune() function, for a grid of keepX values. Note that the function has been set to favor the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation. See ?tune.block.splsda.

test.keepX = list ("mrna" = c(5:9, seq(10, 18, 2), seq(20,30,5)),
                   "mirna" = c(5:9, seq(10, 18, 2), seq(20,30,5)),
                   "prot" = c(5:9, seq(10, 18, 2), seq(20,30,5)))
tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design)

The optimal number of variables to select in each data set and each component is indicated in list.keepX from the tuning step above. Alternatively, you can manually input those parameters as indicated below.

list.keepX = list("mrna" = c(7,12), "mirna" = c(5,25), "prot" = c(5,16)) # from tuning step

Discriminant multiblock model (block.splsda)

The final mixDIABLO model is run as:

sgccda.res = block.splsda(X = data, Y = Y, ncomp = ncomp, keepX = list.keepX, design = design)
#sgccda.res   # list the different functions of interest related to that object

The warning message informs that the outcome Y has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximized, as shown in the final design output:

sgccda.res$design
##       mrna mirna prot Y
## mrna     0     1    1 1
## mirna    1     0    1 1
## prot     1     1    0 1
## Y        1     1    1 0

The selected variables can be extracted with the function selectVar(), for example in the mRNA block:

# mrna variables selected on component 1
selectVar(sgccda.res, block = 'mrna', comp = 1)$mrna$name 
## [1] "ZNF552"  "KDM4B"   "FUT8"    "CCNA2"   "LRIG1"   "C4orf34" "PREX1"

Sample plots

plotDIABLO is a diagnostic plot to check whether the correlation between components from each data set has been maximized as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.

plotDiablo(sgccda.res, ncomp = 1)

plot of chunk unnamed-chunk-12

The first components from each data set are highly correlated to each others. The colors and ellipses related to the sample sub types and indicate the discriminating power of each component to separate the different tumor sub types.

The sample plot with the plotIndiv function projects each sample into the space spanned by the components of each block. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE):

plotIndiv(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

plot of chunk unnamed-chunk-13

In the arrow plot below, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrows the location of that same sample in each block. Such graphic highlight the agreement between all data sets at the sample level, when modeled with DIABLO.

plotArrow(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

plot of chunk unnamed-chunk-14

Variable plots

Several graphical outputs are available to visualize and mine the associations between the selected variables.

The correlation circle plot highlights the contribution of each selected variable to each component, see [3]. plotVar displays the variables from all blocks, selected on component 1 and 2 (see here for more examples). Clusters of points indicate a strong correlation between variables. For better visibility we choose to hide the variable names:

plotVar(sgccda.res, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), col = c('blue', 'red2', 'darkgreen'))

plot of chunk unnamed-chunk-15

The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix, extended to the case of multiple data sets from [3].

circosPlot(sgccda.res, cutoff = 0.7, line = FALSE)

plot of chunk unnamed-chunk-16

Another visualization of the correlation between the different types of variables is the relevance network, that is also built on the similarity matrix. Each color represents a type of variable.

network(sgccda.res, blocks = c(1,2,3), cex.node.name = 0.6, 
        color.node = c('lightblue', 'red2', 'lightgreen'))

plot of chunk unnamed-chunk-17

plotLoadings()* visualizes the loading weights of each selected variables on each component and each data set. The color indicates the class in which the variable has the maximum level of expression (contrib = ‘max’), on average (method = ‘mean’) or using the median (method = ‘median’).

plotLoadings(sgccda.res, comp = 2, contrib = 'max', method = 'median')

plot of chunk unnamed-chunk-18

The cimDIABLO() function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample.

cimDiablo(sgccda.res)

plot of chunk unnamed-chunk-19

Performance of the model

We assess the performance of the model using 10 x 10-fold cross-validation using the function perf(). The method runs a block.splsda model on the pre-specified arguments input from our sgccda.res object but on cross-validated samples. We then assess the accuracy of the prediction on the left out samples.

In addition to the usual (balanced) classification error rates, predicted scores and stability of the selected features, perf() for DIABLO outputs the performance based on Majority Vote (each data set votes for a prediction for a particular test sample) or the Average Prediction (the prediction scores are continuous values averaged across all data sets).

perf.diablo = perf(sgccda.res, validation = 'Mfold', M = 10, 
                   dist = 'mahalanobis.dist', nrepeat = 10)
#perf.diablo  # lists the different outputs

# Performance with Majority vote
kable(perf.diablo$MajorityVote.error.rate)
comp 1 comp 2
Basal 0.0311111 0.0466667
Her2 1.0000000 0.6900000
LumA 0.0026667 0.0026667
Overall.ER 0.2106667 0.1533333
Overall.BER 0.3445926 0.2464444
# Performance with Weighted prediction
kable(perf.diablo$WeightedPredict.error.rate)
comp 1 comp 2
Basal 0.0422222 0.0755556
Her2 1.0000000 0.6800000
LumA 0.0160000 0.0226667
Overall.ER 0.2206667 0.1700000
Overall.BER 0.3527407 0.2594074

Prediction on an external test set

The predict() function predicts the class of samples from a test set. In our specific case on data set is missing in the test set.

# prepare test set data: here one block (proteins) is missing
data.test.TCGA = list(mrna = breast.TCGA$data.test$mrna, 
                      mirna = breast.TCGA$data.test$mirna)

predict.diablo = predict(sgccda.res, newdata = data.test.TCGA, dist = 'mahalanobis.dist')
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs

# the confusion table compares the real subtypes with the predicted subtypes for a 2 component model
table(predict.diablo$MajorityVote$mahalanobis.dist[,2], breast.TCGA$data.test$subtype)
##        
##         Basal Her2 LumA
##   Basal    17    2    0
##   Her2      1    6    1
##   LumA      0    0   28

More details?

Have a look at our reference [4] and [5] and the supplemental of [5] in particular!

Reference

  1. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874, 2001.
  2. Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.
  3. Gonzalez, I, Le Cao, K.A., Davis, M.J. and Dejean, S., 2012. Visualising associations between paired ‘omics’ data sets. BioData mining, 5(1)
  4. Singh, A, Gautier, B, Shannon, CP, Vacher, M, Rohart, F, Tebutt, SJ. and Le Cao, KA, 2016. DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv, p.067611.
  5. Rohart F, Gautier, B, Singh, A and Lê Cao, KA. mixOmics: an R package for ‘omics feature selection and multiple data integration. On bioRxiv. Pdf, Sweave and R scripts available here.