N-Integration discriminant analysis with DIABLO

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumors can be classified into several sub types, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing). The aim of our N-integration analysis is to identify a highly correlated multi-‘omics signature discriminating the breast cancer subtypes subgroups Basal, Her2 and LumA.

The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. In the example below illustrate an analysis on a smaller data set that is stored in the mixOmics package.

Load the latest version of mixOmics (check the version!).

library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.2.0
## 
## Visit http://www.mixOmics.org for more details about our methods.
## Any bug reports or comments? Notify us at mixomics at math.univ-toulouse.fr or https://bitbucket.org/klecao/package-mixomics/issues
## 
## Thank you for using mixOmics!

The R script is provided with this tutorial, along with the RData to load some already-run results to save some time.

Data

Data preparation

The data for mixDIABLO need to be set up as a list of data matrices matching the same samples. The blocks: are ’omics data sets, where each row matches to the same biological sample from one data set to another. The outcome Y is set as a factor for the supervised analysis. Here Y corresponds to the PAM50 classification.

The data for DIABLO need to be set up as a list of data matrices matching the same samples in rows.

data('breast.TCGA')
# extract training data
data = list(mRNA = breast.TCGA$data.train$mrna, 
            miRNA = breast.TCGA$data.train$mirna, 
            proteomics = breast.TCGA$data.train$protein)

# check dimension
lapply(data, dim)
## $mRNA
## [1] 150 200
## 
## $miRNA
## [1] 150 184
## 
## $proteomics
## [1] 150 142
# outcome
Y = breast.TCGA$data.train$subtype
summary(Y)
## Basal  Her2  LumA 
##    45    30    75

Parameter choice

Design

The matrix design determines which blocks should be connected to maximize the correlation or covariance between components. The values may range between 0 (no correlation) to 1 (correlation to maximize) and is a symmetrical matrix. The design can be chosen based on prior knowledge (‘I expect mRNA and miRNA to be highly correlated’) or data-driven (e.g. based on a prior analysis, such as a non sparse analysis with block.pls to examine the correlation between the different blocks via the modelled components). Our experience has shown that a compromise between maximising the correlation between blocks, and discriminating the outcome needed to be achieved, and that the weight in the design matrix could be set to < 1 between blocks (see an example in our vignette and information in supplemental material in [4]).

Here we choose a design where all the blocks are connected with a link of 0.1 (see Supplemental Information S3). We set the number of components to K-1, where K is the number of categories in the Y outcome.

design = matrix(0.1, ncol = length(data), nrow = length(data), 
                dimnames = list(names(data), names(data)))
diag(design) = 0

design 
##            mRNA miRNA proteomics
## mRNA        0.0   0.1        0.1
## miRNA       0.1   0.0        0.1
## proteomics  0.1   0.1        0.0

Tuning the number of components and the number of variables to select

Refer to the full example with the tuning step at http://mixomics.org/tcga-example/

Here we load pre-computed results.

The number of features to select on each component is returned from the tuning step, or, alternatively we can manually input those parameters as indicated below.

load('RData/result-TCGA-diablo_design0.1.RData')
list.keepX = tune.TCGA$choice.keepX
ncomp = 3
# manual input
list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7)) 

Final model

The final DIABLO model is run as:

sgccda.res = block.splsda(X = data, Y = Y, ncomp = ncomp, 
                          keepX = list.keepX, design = design)
#sgccda.res   # list the different functions of interest related to that object

The warning message informs that the outcome Y has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximised, as shown in the final design output

sgccda.res$design
##            mRNA miRNA proteomics Y
## mRNA        0.0   0.1        0.1 1
## miRNA       0.1   0.0        0.1 1
## proteomics  0.1   0.1        0.0 1
## Y           1.0   1.0        1.0 0

The selected variables can be extracted with the function selectVar, for example in the mRNA block:

# mrna variables selected on component 1
selectVar(sgccda.res, block = 'mRNA', comp = 1)$mRNA$name 
## [1] "ZNF552" "KDM4B"  "CCNA2"  "PREX1"  "LRIG1"

Sample plots

plotDIABLO is a diagnostic plot to check whether the correlation between components from each data set has been maximized as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.

plotDiablo(sgccda.res, ncomp = 1)

The first components from each data set are highly correlated to each other. The colors and ellipses related to the sample subtypes and indicate the discriminative power of each component to separate the different tumour subtypes.

The sample plot with the plotIndiv function projects each sample into the space spanned by the components of each block. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE):

plotIndiv(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

In the arrow plot below, the start of the arrow indicates the centroid between all data sets for a given sample and the tips of the arrows the location of that sample in each block. Such graphic highlight the agreement between all data sets at the sample level, when modelled with DIABLO.

plotArrow(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')

Variable plots

Several graphical outputs are available to visualize and mine the associations between the selected variables.

The correlation circle plot highlights the contribution of each selected variable to each component, see [3]. plotVar displays the variables from all blocks, selected on component 1 and 2 (see here for more examples). Clusters of points indicate a strong correlation between variables. For better visibility we choose to hide the variable names:

plotVar(sgccda.res, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), col = c('darkorchid', 'brown1', 'lightgreen'))

The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix, extended to the case of multiple data sets from [3].

circosPlot(sgccda.res, cutoff = 0.7, line = FALSE, 
           color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
           color.cor = c("chocolate3","grey20"), size.labels = 1.5)

Another visualisation of the correlation between the different types of variables is the relevance network, which is also built on the similarity matrix. Each color represents a type of variable. A threshold can also be set using the argument cutoff.

Note that sometimes the output may not show with Rstudio because of margin issues. The plot can be saved as an image using the argument save and name.save. An interactive argument is also available for the cutoff argument, see details in ?network.

# if R studio is giving you trouble, best is to save
network(sgccda.res, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4, 
        save = 'jpeg', name.save = 'mynetwork')

The network can be saved in a .gml format to be input into the software Cytoscape, using the R package igraph

# not run
library(igraph)
my.network = network(sgccda.res, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph(my.network$gR, file = "myNetwork.gml", format = "gml")

plotLoadings() visualizes the loading weights of each selected variables on each component and each data set. The color indicates the class in which the variable has the maximum level of expression (contrib = ‘max’), on average (method = ‘mean’) or using the median (method = ‘median’).

plotLoadings(sgccda.res, comp = 2, contrib = 'max', method = 'median')

The cimDIABLO() function is a clustered image map specifically implemented to represent the multi-`omics molecular signature expression for each sample.

# cimDiablo(sgccda.res)

Performance of the model

See our website link.

AUC

An AUC plot per block can also be obtained using the function auroc. Refer to [5] for the interpretation of such output as the ROC and AUC criteria are not particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis.

auc.splsda = auroc(sgccda.res, roc.block = "miRNA", roc.comp = 2)

Prediction on an external test set

The predict function predicts the class of samples from a test set. In our specific case, one data set is missing in the test set but the method can still be applied. Make sure the name of the blocks correspond exactly.

# prepare test set data: here one block (proteins) is missing
data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna, 
                      miRNA = breast.TCGA$data.test$mirna)

predict.diablo = predict(sgccda.res, newdata = data.test.TCGA)
# the warning message will inform us that one block is missing
#predict.diablo # list the different outputs

The confusion table compares the real subtypes with the predicted subtypes for a 2 component model, for the distance of interest:

confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype, 
                     predicted = predict.diablo$WeightedVote$centroids.dist[,2])
confusion.mat
##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 20                 1                 0
## Her2                   0                13                 1
## LumA                   0                 3                32
get.BER(confusion.mat)
## [1] 0.06825397

Session information of this code

sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.5
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] mixOmics_6.2.0  ggplot2_2.2.1   lattice_0.20-34 MASS_7.3-45    
## [5] knitr_1.16     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.11       RColorBrewer_1.1-2 plyr_1.8.4        
##  [4] tools_3.3.2        digest_0.6.12      jsonlite_1.4      
##  [7] evaluate_0.10      tibble_1.3.1       gtable_0.2.0      
## [10] rlang_0.1.1        igraph_1.0.1       shiny_1.0.3       
## [13] DBI_0.5-1          yaml_2.1.14        parallel_3.3.2    
## [16] stringr_1.2.0      dplyr_0.5.0        htmlwidgets_0.8   
## [19] rprojroot_1.1      grid_3.3.2         ellipse_0.3-8     
## [22] R6_2.2.1           rgl_0.97.0         rmarkdown_1.2     
## [25] reshape2_1.4.2     tidyr_0.6.3        corpcor_1.6.8     
## [28] magrittr_1.5       backports_1.0.4    scales_0.4.1      
## [31] htmltools_0.3.6    assertthat_0.2.0   mime_0.5          
## [34] colorspace_1.3-2   xtable_1.8-2       httpuv_1.3.3      
## [37] labeling_0.3       stringi_1.1.5      lazyeval_0.2.0    
## [40] munsell_0.4.3

More details?

Have a look at our reference [4] and [5] and the supplemental of [5] in particular!

References

  1. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874, 2001.

  2. Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.

  3. Gonzalez, I, Le Cao, K.A., Davis, M.J. and Dejean, S., 2012. Visualising associations between paired ‘omics’ data sets. BioData mining, 5(1)

  4. Singh, A, Gautier, B, Shannon, CP, Vacher, M, Rohart, F, Tebutt, SJ. and Le Cao, KA, 2016. DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv, p.067611.

  5. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for ’omics feature selection and multiple data integration.

  6. Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells, PeerJ 4:e1845