Case study: TCGA

N-Integration discriminant analysis with DIABLO

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumors can be classified into several sub types, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing). The aim of our N-integration analysis is to identify a highly correlated multi-‘omics signature discriminating the breast cancer subtypes subgroups Basal, Her2 and LumA.

The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. In the example below illustrate an analysis on a smaller data set that is stored in the mixOmics package.

Load the latest version of mixOmics (check the version ad the sessionInfo() output at the end of this page).



The R script is available at this link


Data preparation

The data for mixDIABLO need to be set up as a list of data matrices matching the same samples. The blocks: are 'omics data sets, where each row matches to the same biological sample from one data set to another. The outcome Y is set as a factor for the supervised analysis. Here Y corresponds to the PAM50 classification.

The data for DIABLO need to be set up as a list of data matrices matching the same samples in rows.

# extract training data
data = list(mRNA = breast.TCGA$data.train$mrna, 
            miRNA = breast.TCGA$data.train$mirna, 
            proteomics = breast.TCGA$data.train$protein)

# check dimension
lapply(data, dim)
## $mRNA
## [1] 150 200
## $miRNA
## [1] 150 184
## $proteomics
## [1] 150 142
# outcome
Y = breast.TCGA$data.train$subtype
## Basal  Her2  LumA 
##    45    30    75

Parameter choice


The matrix design determines which blocks should be connected to maximize the correlation or covariance between components. the values may range between 0 (no correlation) to 1 (correlation to maximize) and is a symmetrical matrix. The design can be chosen based on prior knowledge (‘I expect mRNA and miRNA to be highly correlated’) or data-driven (e.g. based on a prior analysis, such as a non sparse analysis with block.pls to examine the correlation between the different blocks via the modelled components).
Our experience has shown that a compromise between maximising the correlation between blocks, and discriminating the outcome needed to be achieved, and that the weight in the design matrix could be set to < 1 between blocks (see an example in our vignette and information in supplemental material in [4]).

Here we choose a design where all the blocks are connected with a link of 0.1 (see Supplemental Information S3). We set the number of components to K-1, where K is the number of categories in the Y outcome.

design = matrix(0.1, ncol = length(data), nrow = length(data), 
                dimnames = list(names(data), names(data)))
diag(design) = 0

##            mRNA miRNA proteomics
## mRNA        0.0   0.1        0.1
## miRNA       0.1   0.0        0.1
## proteomics  0.1   0.1        0.0

Tuning the number of components

First, we fit a DIABLO model without variable selection to assess the global performance and choose the number of components for the final DIABLO model. The function perf is run with 10-fold cross validation repeated 10 times.

sgccda.res = block.splsda(X = data, Y = Y, ncomp = 5, 
                           design = design)

set.seed(123) # for reproducibility, only when the `cpus' argument is not used
# this code takes a couple of min to run
perf.diablo = perf(sgccda.res, validation = 'Mfold', folds = 10, nrepeat = 10)

#perf.diablo  # lists the different outputs

plot of chunk unnamed-chunk-4

From the performance plot above we observe that both overall and balanced error rate (BER) decrease from 1 to 2 components. The standard deviation indicates a potential slight gain in adding more components. The centroids.dist distance seem to give the best accuracy (see Supplemental Material in [6]). Considering this distance and the BER, the output $choice.ncomp indicates an optimal number of components for the final DIABLO model.

##             max.dist centroids.dist mahalanobis.dist
## Overall.ER         2              3                3
## Overall.BER        5              2                4
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]

Tuning keepX

This tuning function should be used to tune the keepX parameters in the block.splsda function.

We choose the optimal number of variables to select in each data set using the tune function, for a grid of keepX values. Note that the function has been set to favor the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation. See ?tune.block.splsda.

We choose the optimal number of variables to select in each data set using the tune.block.splsda function, to which we provide a grid of keepX values for each type of omics. Note that the function has been set to favour the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation.

The function tune is run with 10-fold cross validation, but repeated only once. Note that for a more thorough tuning process, provided sufficient computational time, we could increase the nrepeat argument.

Here we have saved the results into a RData object that is available for download (the tuning takes ~ 1h30).

#set.seed(123) # for reproducibility, only when the `cpus' argument is not used
test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
                   miRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
                   proteomics = c(5:9, seq(10, 18, 2), seq(20,30,5)))

tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 1,
                              cpus = 2, dist = "centroids.dist")

list.keepX = tune.TCGA$choice.keepX

The number of features to select on each component is returned in tune.TCGA$choice.keepX. Alternatively, you can manually input those parameters as indicated below.

list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7)) # from tuning step

Final model

The final DIABLO model is run as:

sgccda.res = block.splsda(X = data, Y = Y, ncomp = ncomp, 
                          keepX = list.keepX, design = design)
#sgccda.res   # list the different functions of interest related to that object

The warning message informs that the outcome Y has been included automatically in the design, so that the covariance between each block's component and the outcome is maximised, as shown in the final design output

##            mRNA miRNA proteomics Y
## mRNA        0.0   0.1        0.1 1
## miRNA       0.1   0.0        0.1 1
## proteomics  0.1   0.1        0.0 1
## Y           1.0   1.0        1.0 0

The selected variables can be extracted with the function selectVar, for example in the mRNA block:

# mrna variables selected on component 1
selectVar(sgccda.res, block = 'mRNA', comp = 1)$mRNA$name 
## [1] "ZNF552" "KDM4B"  "PREX1"  "CCNA2"  "LRIG1"  "FUT8"

Sample plots

plotDIABLO is a diagnostic plot to check whether the correlation between components from each data set has been maximized as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.

plotDiablo(sgccda.res, ncomp = 1)

plot of chunk unnamed-chunk-13

The first components from each data set are highly correlated to each other. The colors and ellipses related to the sample subtypes and indicate the discriminative power of each component to separate the different tumour subtypes.

The sample plot with the plotIndiv function projects each sample into the space spanned by the components of each block. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE):

plotIndiv(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

plot of chunk unnamed-chunk-14

In the arrow plot below, the start of the arrow indicates the centroid between all data sets for a given sample and the tips of the arrows the location of that sample in each block. Such graphic highlight the agreement between all data sets at the sample level, when modelled with DIABLO.

plotArrow(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

plot of chunk unnamed-chunk-15

Variable plots

Several graphical outputs are available to visualize and mine the associations between the selected variables.

The correlation circle plot highlights the contribution of each selected variable to each component, see [3]. plotVar displays the variables from all blocks, selected on component 1 and 2 (see here for more examples). Clusters of points indicate a strong correlation between variables. For better visibility we choose to hide the variable names:

plotVar(sgccda.res, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), col = c('darkorchid', 'brown1', 'lightgreen'))

plot of chunk unnamed-chunk-16

The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix, extended to the case of multiple data sets from [3].

circosPlot(sgccda.res, cutoff = 0.7, line = TRUE, 
           color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
           color.cor = c("chocolate3","grey20"), size.labels = 1.5)

plot of chunk unnamed-chunk-17

Another visualisation of the correlation between the different types of variables is the relevance network, which is also built on the similarity matrix. Each color represents a type of variable. A threshold can also be set using the argument cutoff.

Note that sometimes the output may not show with Rstudio because of margin issues. The plot can be saved as an image using the argument save and An interactive argument is also available for the cutoff argument, see details in ?network.

network(sgccda.res, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)

plot of chunk unnamed-chunk-18

The network can be saved in a .gml format to be input into the software Cytoscape, using the R package igraph

# not run
library(igraph) = network(sgccda.res, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph($gR, file = "myNetwork.gml", format = "gml")

plotLoadings() visualizes the loading weights of each selected variables on each component and each data set. The color indicates the class in which the variable has the maximum level of expression (contrib = 'max'), on average (method = 'mean') or using the median (method = 'median').

plotLoadings(sgccda.res, comp = 2, contrib = 'max', method = 'median')

plot of chunk unnamed-chunk-20

The cimDIABLO() function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample.


plot of chunk unnamed-chunk-21

Performance of the model

We assess the performance of the model using 10-fold cross-validation repeated 10 times, using the function perf. The method runs a block.splsda model on the pre-specified arguments input from our sgccda.res object but on cross-validated samples. We then assess the accuracy of the prediction on the left out samples.

In addition to the usual (balanced) classification error rates, predicted dummy variables and variates, as well as stability of the selected features, the perf function for DIABLO outputs the performance based on Majority Vote (each data set votes for a class for a particular test sample) or a weighted vote, where the weight is defined according to the correlation between the latent component associated to a particular data set and the outcome.

Since the tune function was used with the centroid.dist argument, we examine the outputs of the perf function for that same distance.
The following code may take a few minutes to run.

set.seed(123)# for reproducibility, only when the `cpus' argument is not used
perf.diablo = perf(sgccda.res, validation = 'Mfold', M = 10, nrepeat = 10, 
                   dist = 'centroids.dist')
#perf.diablo  # lists the different outputs

# Performance with Majority vote
## $centroids.dist
##                 comp 1     comp 2
## Basal       0.04888889 0.06444444
## Her2        0.25000000 0.16000000
## LumA        0.07466667 0.02133333
## Overall.ER  0.10200000 0.06200000
## Overall.BER 0.12451852 0.08192593
# Performance with Weighted prediction
## $centroids.dist
##                 comp 1     comp 2
## Basal       0.03777778 0.06444444
## Her2        0.18666667 0.15333333
## LumA        0.07466667 0.02133333
## Overall.ER  0.08600000 0.06066667
## Overall.BER 0.09970370 0.07970370

An AUC plot per block can also be obtained using the function auroc. Refer to [5] for the interpretation of such output as the ROC and AUC criteria are not particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis.

auc.splsda = auroc(sgccda.res, roc.block = "miRNA", roc.comp = 2)

plot of chunk unnamed-chunk-23

Prediction on an external test set

The predict function predicts the class of samples from a test set. In our specific case, one data set is missing in the test set but the method can still be applied. Make sure the name of the blocks correspond exactly.

# prepare test set data: here one block (proteins) is missing
data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna, 
                      miRNA = breast.TCGA$data.test$mirna)

predict.diablo = predict(sgccda.res, newdata = data.test.TCGA)
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs

The confusion table compares the real subtypes with the predicted subtypes for a 2 component model, for the distance of interest:

confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype, 
                     predicted = predict.diablo$WeightedVote$centroids.dist[,2])
## Basal                 20                 1                 0
## Her2                   0                13                 1
## LumA                   0                 3                32
## [1] 0.06825397

Session information of this code

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
## [1] mixOmics_6.3.0     ggplot2_2.2.1.9000 lattice_0.20-35   
## [4] MASS_7.3-47        knitr_1.17         RWordPress_0.2-3  
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.0   purrr_0.2.3        reshape2_1.4.2    
##  [4] colorspace_1.3-2   htmltools_0.3.6    yaml_2.1.14       
##  [7] XML_3.98-1.9       rlang_0.1.2        glue_1.1.1        
## [10] RColorBrewer_1.1-2 bindrcpp_0.2       matrixStats_0.52.2
## [13] plyr_1.8.4         bindr_0.1          stringr_1.2.0     
## [16] munsell_0.4.3      gtable_0.2.0       htmlwidgets_0.9   
## [19] evaluate_0.10.1    labeling_0.3       httpuv_1.3.5      
## [22] parallel_3.3.2     markdown_0.8       XMLRPC_0.3-0      
## [25] highr_0.6          rARPACK_0.11-0     Rcpp_0.12.13      
## [28] xtable_1.8-2       corpcor_1.6.9      scales_0.5.0.9000 
## [31] backports_1.1.0    jsonlite_1.5       mime_0.5          
## [34] RSpectra_0.12-0    gridExtra_2.3      ellipse_0.3-8     
## [37] digest_0.6.12      stringi_1.1.5      dplyr_0.7.4       
## [40] shiny_1.0.5        grid_3.3.2         rprojroot_1.2     
## [43] tools_3.3.2        bitops_1.0-6       magrittr_1.5      
## [46] rgl_0.98.1         lazyeval_0.2.0     RCurl_1.95-4.8    
## [49] tibble_1.3.4       tidyr_0.7.1        pkgconfig_2.0.1   
## [52] Matrix_1.2-11      assertthat_0.2.0   rmarkdown_1.6     
## [55] R6_2.2.2           igraph_1.1.2

More details?

Have a look at our reference [4] and [5] and the supplemental of [5] in particular!


  1. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874, 2001.

  2. Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.

  3. Gonzalez, I, Le Cao, K.A., Davis, M.J. and Dejean, S., 2012. Visualising associations between paired ‘omics’ data sets. BioData mining, 5(1)

  4. Singh, A, Gautier, B, Shannon, CP, Vacher, M, Rohart, F, Tebutt, SJ. and Le Cao, KA, 2016. DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv, p.067611.

  5. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.

  6. Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells, PeerJ 4:e1845