Case Study of DIABLO with Breast TCGA Dataset

The aim of this N-integration analysis is to identify a highly correlated multi-omics signature that discriminates the breast cancer subtypes Basal, Her2 and LumA.

The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. The example below illustrates an analysis on a smaller data set that is stored in the mixOmics package.


The R script used for all the analysis in this case study is available here.

To begin

Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.

library(mixOmics) # import the mixOmics library

set.seed(123) # for reproducibility, remove for normal use

The data

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumours can be classified into several subtypes, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing).

The mixOmics TCGA dataset is accessed via breast.TCGA' and contains the following:

  • breast.TCGA$data.train$mirna (continuous matrix): 150 rows and 184 columns. The expression levels of 184 different sections of miRNA.
  • breast.TCGA$data.train$mrna (continuous matrix): 150 rows and 200 columns. The expression levels of 200 different sections of mRNA.
  • breast.TCGA$data.train$protein (continuous matrix): 150 rows and 142 columns. The abundance of 142 different proteins
  • breast.TCGA$data.train$subtype (categorical vector): length of 150. Indicates the breast cancer subtype of each subject. Includes Basal, Her2 and LumA.

To confirm the correct dataframes were extracted, the dimensions are checked. The distribution of class labels is also examined. It can be seen that these class labels are not balanced. Note that the dataframes in the data list should be named.

data(breast.TCGA) # load in the data

# set a list of all the X dataframes
data = list(miRNA = breast.TCGA$data.train$mirna, 
            mRNA = breast.TCGA$data.train$mrna,
            proteomics = breast.TCGA$data.train$protein)

lapply(data, dim) # check their dimensions
## $miRNA
## [1] 150 184
## $mRNA
## [1] 150 200
## $proteomics
## [1] 150 142
Y = breast.TCGA$data.train$subtype # set the response variable as the Y df
## Basal  Her2  LumA 
##    45    30    75

Initial Analysis

Pairwise PLS Comparisons

As mentioned in the methods pages, it is strongly advised that prior to using the DIABLO framework that the data be examinsed in a non-integrative context. Here, the correlation between the top 25 features of each dataframe (in a pairwise fashion) are shown in Figure 1.

list.keepX = c(25, 25) # select arbitrary values of features to keep
list.keepY = c(25, 25)

# generate three pairwise PLS models
pls1 <- spls(data[["miRNA"]], data[["mRNA"]], 
             keepX = list.keepX, keepY = list.keepY) 
pls2 <- spls(data[["miRNA"]], data[["proteomics"]], 
             keepX = list.keepX, keepY = list.keepY)
pls3 <- spls(data[["mRNA"]], data[["proteomics"]], 
             keepX = list.keepX, keepY = list.keepY)

# plot features of first PLS
plotVar(pls1, cutoff = 0.5, title = "(a) miRNA vs mRNA", 
        legend = c("miRNA", "mRNA"), 
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

# plot features of second PLS
plotVar(pls2, cutoff = 0.5, title = "(b) miRNA vs proteomics", 
        legend = c("miRNA", "proteomics"), 
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

# plot features of third PLS
plotVar(pls3, cutoff = 0.5, title = "(c) mRNA vs proteomics", 
        legend = c("mRNA", "proteomics"), 
        var.names = FALSE, style = 'graphics', 
        pch = c(16, 17), cex = c(2,2), 
        col = c('darkorchid', 'lightgreen'))

plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

FIGURE 1: Circle Correlation Plots for pairwise PLS models on the breast TCGA data. Only displays the top 25 features for each dimension, subsetting by those with a correlation above 0.5.

The strong correlation between a large portion of the miRNA and mRNA features along the first component and to a slightly lesser extent on the second component is depicted in (a). Some of the features in (b) have very strong clustering while other do not, suggesting that some features of the mRNA and protein dataframes are highly correlated. The mRNA and protein features in © are the most weakly clustered but are the most strongly correlated along the first component specifically.

Below is are correlations between the first component of each dataset for all three PLS models. It can be seen the values are around ~0.8 – 0.9.

# calculate correlation of miRNA and mRNA
cor(pls1$variates$X, pls1$variates$Y) 
##           comp1
## comp1 0.8841792
# calculate correlation of miRNA and proteins
cor(pls2$variates$X, pls2$variates$Y) 
##           comp1
## comp1 0.8361993
# calculate correlation of mRNA and proteins
cor(pls3$variates$X, pls3$variates$Y) 
##           comp1
## comp1 0.9360264

Initial DIABLO Model


The moderate to high correlation between these features means that an appropriate value for the design matrix would be about ~0.8 – 0.9. However, as discussed in the N-Integration Methods page, values above 0.5 will cause a reduction in predictive ability of the model – and prediction is what is desired in this context. Hence, a value of 0.1 will be used to prioritise the discriminative ability of the model.

# for square matrix filled with 0.1s
design = matrix(0.1, ncol = length(data), nrow = length(data), 
                dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s

##            miRNA mRNA proteomics
## miRNA        0.0  0.1        0.1
## mRNA         0.1  0.0        0.1
## proteomics   0.1  0.1        0.0

With a design in place, the initial DIABLO model can be generated. An arbitrarily high number of components (ncomp = 5) will be used.

# form basic DIABLO model
basic.diablo.model = block.splsda(X = data, Y = Y, ncomp = 5, design = design) 

Tuning the number of components

To choose the number of components for the final DIABLO model, the function perf() is run with 10-fold cross-validation repeated 10 times.

# run component number tuning with repeated CV
perf.diablo = perf(basic.diablo.model, validation = 'Mfold', 
                   folds = 10, nrepeat = 10) 

plot(perf.diablo) # plot output of tuning

plot of chunk unnamed-chunk-8

FIGURE 2: Choosing the number of components in block.plsda using perf() with 10 × 10-fold CV function in the breast.TCGA study. Classification error rates (overall and balanced, see Section 7.3) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA

From the performance plot above (Figure 2) we observe that both overall and balanced error rate (BER) decrease from 1 to 2 components. The standard deviation indicates a potential slight gain in adding more components. The centroids.dist distance seems to give the best accuracy (see Supplemental Material in [6]). Considering this distance and the BER, the output $choice.ncomp indicates the optimal number of components for the final DIABLO model.

# set the optimal ncomp value
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"] 
# show the optimal choice for ncomp for each dist metric
##             max.dist centroids.dist mahalanobis.dist
## Overall.ER         3              2                3
## Overall.BER        3              2                3

Tuning the number of features

This tuning function should be used to tune the keepX parameters in the block.splsda() function.

We choose the optimal number of variables to select in each data set using the tune.block.splsda() function, for a grid of keepX values for each type of omics. Note that the function has been set to favour a relatively small signature while allowing us to obtain a sufficient number of variables for downstream validation and/or interpretation. See ?tune.block.splsda.

The function tune is run with 10-fold cross validation, but repeated only once. Note that for a more thorough tuning process, provided sufficient computational time, we could increase the nrepeat argument. Here we have saved the results into an RData object that is available for download as the tuning can take a very long time, especially on lower end machines.

# set grid of values for each component to test
test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), 
                   miRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
                   proteomics = c(5:9, seq(10, 18, 2), seq(20,30,5)))

# run the feature selection tuning
tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 1,
                              dist = "centroids.dist")

The number of features to select on each component is returned in tune.TCGA$choice.keepX.

list.keepX = tune.TCGA$choice.keepX # set the optimal values of features to retain
## $miRNA
## [1] 10  5
## $mRNA
## [1] 25 16
## $proteomics
## [1] 8 5

Final model

The final DIABLO model is run as:

# set the optimised DIABLO model
final.diablo.model = block.splsda(X = data, Y = Y, ncomp = ncomp, 
                          keepX = list.keepX, design = design)

The warning message informs us that the outcome y has been included automatically in the design, so that the covariance between each block's component and the outcome is maximised, as shown in the final design output.

final.diablo.model$design # design matrix for the final model
##            miRNA mRNA proteomics Y
## miRNA        0.0  0.1        0.1 1
## mRNA         0.1  0.0        0.1 1
## proteomics   0.1  0.1        0.0 1
## Y            1.0  1.0        1.0 0

The selected variables can be extracted with the function selectVar(), for example in the mRNA block, as seen below. Note that the stability of selected variables can be extracted from the output of the perf() function.

# the features selected to form the first component
selectVar(final.diablo.model, block = 'mRNA', comp = 1)$mRNA$name 
##  [1] "ZNF552"  "KDM4B"   "CCNA2"   "LRIG1"   "PREX1"   "FUT8"    "C4orf34" "TTC39A"  "ASPM"    "SLC43A3"
## [11] "MEX3A"   "SEMA3C"  "E2F1"    "STC2"    "FMNL2"   "LMO4"    "MED13L"  "DTWD2"   "CSRP2"   "NTN4"   
## [21] "KIF13B"  "NCAPG2"  "SLC19A2" "EPHB3"   "FAM63A"


Sample plots

plotDIABLO() is a diagnostic plot to check whether the correlation between components from each data set has been maximised as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.

plotDiablo(final.diablo.model, ncomp = 1)

plot of chunk unnamed-chunk-17

FIGURE 3: Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA study. Samples are represented based on the specified component (here ncomp = 1) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype and 95% confidence ellipse plots are represented.

As can be seen in Figure 3, the first components from each data set are highly correlated to each other (indicated by the large numbers in the bottom left). The colours and ellipses related to the sample subtypes indicate the discriminative power of each component to separate the different tumour subtypes. For the first component, the centroids of each subtype are distinct but there is a moderate amount of overlap between each sample group in their confidence ellipses.

The sample plot with the plotIndiv() function projects each sample into the space spanned by the components of each block (Figure 4). Clustering of the samples can be better assessed with this plot. It seems that the quality of clustering of the mRNA data is the highest with the miRNA clustering quality being the lowest. This suggests that the mRNA is likely to hold more discriminative power within the model.

plotIndiv(final.diablo.model, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO Sample Plots')

plot of chunk unnamed-chunk-18

FIGURE 4: Sample plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype

In the arrow plot below (Figure), the start of the arrow indicates the centroid between all data sets for a given sample and the tips of the arrows indicate the location of that sample in each block. Such graphics highlight the agreement between all data sets at the sample level. While somewhat difficult to interpret, even qualitatively, Figure 5 shows that the agreement within the LumA group seems to be the highest and lowest in the Her2 group.

plotArrow(final.diablo.model, ind.names = FALSE, legend = TRUE, 
          title = 'DIABLO')

plot of chunk unnamed-chunk-19

FIGURE 5: Arrow plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets.

Variable plots

Several graphical outputs are available to visualise and mine the associations between the selected variables.

The best starting point to evaluate the correlation structure between variables is with the correlation circle plot, depicted in Figure 6. A majority of the miRNA variables are positively correlated with the first component while the mRNA variables seem to separate along this dimension. These first two components correlate highly with the selected variables from the proteomics dataset. From this, the correlation of each selected feature from all three datasets can be evaluated based on their proximity. s

plotVar(final.diablo.model, var.names = FALSE, 
        style = 'graphics', legend = TRUE,
        pch = c(16, 17, 15), cex = c(2,2,2), 
        col = c('darkorchid', 'brown1', 'lightgreen'))

plot of chunk unnamed-chunk-20

FIGURE 6: Correlation circle plot from multiblock sPLS-DA performed on the breast.TCGA study. Variable types are indicated with different symbols and colours, and are overlaid on the same plot.

The circos plot is exclusive to integrative frameworks and represents the correlations between variables of different types, represented on the side quadrants. From Figure 7, it seems that the miRNA variables are almost entirely negatively correlated with the other two dataframes. The proteomics features are the opposite, such that they display primarily positive correlations while the mRNA variables are more mixed. Note that these correlations are above a value of 0.7 (cutoff = 0.7). All the interpretations made above are only relevant for features with very strong correlations.

circosPlot(final.diablo.model, cutoff = 0.7, line = TRUE,
           color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
           color.cor = c("chocolate3","grey20"), size.labels = 1.5)

plot of chunk unnamed-chunk-21

FIGURE 7: Circos plot from multiblock sPLS-DA performed on the breast.TCGA study. The plot represents the correlations greater than 0.7 between variables of different types, represented on the side quadrants

Another visualisation of the correlations between the different types of variables is the relevance network, which is also built on the similarity matrix (as is the circos plot). Each colour represents a type of variable. Figure 8 shows this network which has a lower cutoff an Figure 7 (cutoff = 0.4). Two distinct clusters can be observed, though due to the density of the plot the relationships within the cluster are hard to determine. The interactive version of this plot would be useful here.

network(final.diablo.model, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)

plot of chunk unnamed-chunk-22

FIGURE 8: Relevance network for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. Each node represents a selected with colours indicating their type. The colour of the edges represent positive or negative correlations

The network can be saved in a .gml format to be input into the software Cytoscape, using the R package igraph. An example is shown directly below.

my.network = network(final.diablo.model, blocks = c(1,2,3),
        color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph(my.network$gR, file = "myNetwork.gml", format = "gml")

The function plotLoadings() visualises the loading weights of each selected variable on each component and each data set (Figure 9). The colour indicates the class in which the variable has the maximum level of expression (contrib = 'max') using the median (method = 'median'). Figure 9 depicts the loading values for the second dimension.

plotLoadings(final.diablo.model, comp = 2, contrib = 'max', method = 'median')

plot of chunk unnamed-chunk-24

FIGURE 9: Loading plot for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. The most important variables (according to the absolute value of their coefficients) are ordered from bottom to top. As this is a supervised analysis, colours indicate the class for which the median expression value is the highest for each feature

The cimDIABLO() function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample. From Figure 10, the areas of homogeneous expression levels for a set of samples across a set of features can be determined. For instance, the Her2 samples were the only group to show extremely high levels of expression for a specific set of proteins (seen by the small, red block in the middle bottom of Figure 10). This indicates these features are fairly discriminating for this subtype.

## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo

plot of chunk unnamed-chunk-25

FIGURE 10: Clustered Image Map for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. By default, Euclidean distance and Complete linkage methods are used. The CIM represents samples in rows (indicated by their breast cancer subtype on the left hand side of the plot) and selected features in columns (indicated by their data type at the top of the plot).

Performance of the model

We assess the performance of the model using 10-fold cross-validation repeated 10 times, using the function perf(). The method runs a block.splsda() model on the pre-specified arguments input from the final.diablo.model object, but on cross-validated samples. We then assess the accuracy of the prediction on the left-out samples.

In addition to the usual (balanced) classification error rates, predicted dummy variables and variates, as well as the stability of the selected features, the perf() function for DIABLO outputs the performance based on Majority Vote (each data set votes for a class for a particular test sample) or a weighted vote, where the weight is defined according to the correlation between the latent component associated to a particular data set and the outcome.

Since the tune() function was used with the centroid.dist argument, the outputs of the perf() function for that same distance are examined. The following code may take a few minutes to run.

# run repeated CV performance evaluation
perf.diablo = perf(final.diablo.model, validation = 'Mfold', 
                   M = 10, nrepeat = 10, 
                   dist = 'centroids.dist') 

## $centroids.dist
##                  comp1      comp2
## Basal       0.02666667 0.03777778
## Her2        0.16666667 0.14666667
## LumA        0.07333333 0.01600000
## Overall.ER  0.07800000 0.04866667
## Overall.BER 0.08888889 0.06681481
## $centroids.dist
##                   comp1      comp2
## Basal       0.004444444 0.03777778
## Her2        0.100000000 0.10333333
## LumA        0.073333333 0.01600000
## Overall.ER  0.058000000 0.04000000
## Overall.BER 0.059259259 0.05237037

From the above output, it can be seen that the error rate across the board is quite low, indicating the constructed DIABLO model does a fairly good job of classifying novel samples.

An AUC plot per block can also be obtained using the function auroc(). The interpretation of this output may not be particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis..

auc.splsda = auroc(final.diablo.model, roc.block = "miRNA", 
                   roc.comp = 2, print = FALSE)

plot of chunk unnamed-chunk-27

FIGURE 11: ROC and AUC based on multiblock sPLS-DA performed on the breast.TCGA study for the miRNA data set after 2 components. The function calculates the ROC curve and AUC for one class vs. the others.

Prediction on an external test set

The predict() function predicts the class of samples from a test set. In our specific case, one data set is missing in the test set but the method can still be applied. Make sure the name of the blocks correspond exactly.

data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna,
                      miRNA = breast.TCGA$data.test$mirna)

predict.diablo = predict(final.diablo.model, newdata = data.test.TCGA)

The confusion table compares the real subtypes with the predicted subtypes for a 2-component model, for the distance of interest. This model performs quite well as it makes only two errors:

confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype,
                     predicted = predict.diablo$WeightedVote$centroids.dist[,2])
##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 20                 1                 0
## Her2                   0                14                 0
## LumA                   0                 1                34

These two errors correspond to a very low balanced error rate.

## [1] 0.02539683

More information on Plots

For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages:


  1. Sørlie T, Perou CM, Tibshirani R, Aas T, Geisler S, Johnsen H, Hastie T, Eisen MB, Van De Rijn M, Jeffrey SS, et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proceedings of the National Academy of Sciences, 98(19):10869–10874, 2001.

  2. Cancer Genome Atlas Network et al. Comprehensive molecular portraits of human breast tumours. Nature, 490(7418):61–70, 2012.

  3. Gonzalez, I, Le Cao, K.A., Davis, M.J. and Dejean, S., 2012. Visualising associations between paired ‘omics’ data sets. BioData mining, 5(1)

  4. Singh, A, Gautier, B, Shannon, CP, Vacher, M, Rohart, F, Tebutt, SJ. and Le Cao, KA, 2016. DIABLO-an integrative, multi-omics, multivariate method for multi-group classification. bioRxiv, p.067611.

  5. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.

  6. Rohart F, Mason EA, Matigian N, Mosbergen R, Korn O, Chen T, Butcher S, Patel J, Atkinson K, Khosrotehrani K, Fisk NM, Lê Cao K-A&, Wells CA& (2016). A Molecular Classification of Human Mesenchymal Stromal Cells, PeerJ 4:e1845