N-Integration discriminant analysis with DIABLO
Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumors can be classified into several sub types, according to levels of mRNA expression (Sørlie et al., 2001 ). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 ). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing). The aim of our N-integration analysis is to identify a highly correlated multi-‘omics signature discriminating the breast cancer subtypes subgroups Basal, Her2 and LumA.
The full data set for this example (also including methylation data and 4 breast cancer subtypes) can be downloaded here. In the example below illustrate an analysis on a smaller data set that is stored in the mixOmics package.
Load the latest version of mixOmics (check the version!).
The R script is available at this link (R script and RData)
The data for mixDIABLO need to be set up as a list of data matrices matching the same samples. The blocks: are 'omics data sets, where each row matches to the same biological sample from one data set to another. The outcome Y is set as a factor for the supervised analysis. Here Y corresponds to the PAM50 classification.
The data for DIABLO need to be set up as a list of data matrices matching the same samples in rows.
data('breast.TCGA') # extract training data data = list(mRNA = breast.TCGA$data.train$mrna, miRNA = breast.TCGA$data.train$mirna, proteomics = breast.TCGA$data.train$protein) # check dimension lapply(data, dim)
## $mRNA ##  150 200 ## ## $miRNA ##  150 184 ## ## $proteomics ##  150 142
# outcome Y = breast.TCGA$data.train$subtype summary(Y)
## Basal Her2 LumA ## 45 30 75
The matrix design determines which blocks should be connected to maximize the correlation or covariance between components. the values may range between 0 (no correlation) to 1 (correlation to maximize) and is a symmetrical matrix. The design can be chosen based on prior knowledge (‘I expect mRNA and miRNA to be highly correlated’) or data-driven (e.g. based on a prior analysis, such as a non sparse analysis with block.pls to examine the correlation between the different blocks via the modelled components).
Our experience has shown that a compromise between maximising the correlation between blocks, and discriminating the outcome needed to be achieved, and that the weight in the design matrix could be set to < 1 between blocks (see an example in our vignette and information in supplemental material in ).
Here we choose a design where all the blocks are connected with a link of 0.1 (see Supplemental Information S3). We set the number of components to K-1, where K is the number of categories in the Y outcome.
design = matrix(0.1, ncol = length(data), nrow = length(data), dimnames = list(names(data), names(data))) diag(design) = 0 design
## mRNA miRNA proteomics ## mRNA 0.0 0.1 0.1 ## miRNA 0.1 0.0 0.1 ## proteomics 0.1 0.1 0.0
Tuning the number of components
First, we fit a DIABLO model without variable selection to assess the global performance and choose the number of components for the final DIABLO model. The function perf is run with 10-fold cross validation repeated 10 times.
sgccda.res = block.splsda(X = data, Y = Y, ncomp = 5, design = design) set.seed(123) # for reproducibility, only when the `cpus' argument is not used # this code takes a couple of min to run perf.diablo = perf(sgccda.res, validation = 'Mfold', folds = 10, nrepeat = 10) #perf.diablo # lists the different outputs plot(perf.diablo)
From the performance plot above we observe that both overall and balanced error rate (BER) decrease from 1 to 2 components. The standard deviation indicates a potential slight gain in adding more components. The centroids.dist distance seem to give the best accuracy (see Supplemental Material in ). Considering this distance and the BER, the output $choice.ncomp indicates an optimal number of components for the final DIABLO model.
## max.dist centroids.dist mahalanobis.dist ## Overall.ER 5 3 3 ## Overall.BER 3 2 4
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
This tuning function should be used to tune the keepX parameters in the block.splsda function.
We choose the optimal number of variables to select in each data set using the tune function, for a grid of keepX values. Note that the function has been set to favor the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation. See ?tune.block.splsda.
We choose the optimal number of variables to select in each data set using the tune.block.splsda function, to which we provide a grid of keepX values for each type of omics. Note that the function has been set to favour the small-ish signature while allowing to obtain a sufficient number of variables for downstream validation / interpretation.
The function tune is run with 10-fold cross validation, but repeated only once. Note that for a more thorough tuning process, provided sufficient computational time, we could increase the nrepeat argument.
Here we have saved the results into a RData object that is available for download (the tuning takes ~ 1h30).
#set.seed(123) # for reproducibility, only when the `cpus' argument is not used test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), miRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)), proteomics = c(5:9, seq(10, 18, 2), seq(20,30,5))) t1 = proc.time() tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp, test.keepX = test.keepX, design = design, validation = 'Mfold', folds = 10, nrepeat = 1, cpus = 2, dist = "centroids.dist") t2 = proc.time() running_time = t2 - t1; running_time list.keepX = tune.TCGA$choice.keepX list.keepX
The number of features to select on each component is returned in tune.TCGA$choice.keepX. Alternatively, you can manually input those parameters as indicated below.
list.keepX = list(mRNA = c(6,14), miRNA = c(5,18), proteomics = c(6,7)) # from tuning step
The final DIABLO model is run as:
sgccda.res = block.splsda(X = data, Y = Y, ncomp = ncomp, keepX = list.keepX, design = design) #sgccda.res # list the different functions of interest related to that object
The warning message informs that the outcome Y has been included automatically in the design, so that the covariance between each block's component and the outcome is maximised, as shown in the final design output
## mRNA miRNA proteomics Y ## mRNA 0.0 0.1 0.1 1 ## miRNA 0.1 0.0 0.1 1 ## proteomics 0.1 0.1 0.0 1 ## Y 1.0 1.0 1.0 0
The selected variables can be extracted with the function selectVar, for example in the mRNA block:
# mrna variables selected on component 1 selectVar(sgccda.res, block = 'mRNA', comp = 1)$mRNA$name
##  "ZNF552" "KDM4B" "PREX1" "CCNA2" "LRIG1" "FUT8"
plotDIABLO is a diagnostic plot to check whether the correlation between components from each data set has been maximized as specified in the design matrix. We specify which dimension to be assessed with the ncomp argument.
plotDiablo(sgccda.res, ncomp = 1)
The first components from each data set are highly correlated to each other. The colors and ellipses related to the sample subtypes and indicate the discriminative power of each component to separate the different tumour subtypes.
The sample plot with the plotIndiv function projects each sample into the space spanned by the components of each block. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE):
plotIndiv(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')
In the arrow plot below, the start of the arrow indicates the centroid between all data sets for a given sample and the tips of the arrows the location of that sample in each block. Such graphic highlight the agreement between all data sets at the sample level, when modelled with DIABLO.
plotArrow(sgccda.res, ind.names = FALSE, legend = TRUE, title = 'DIABLO')
Several graphical outputs are available to visualize and mine the associations between the selected variables.
The correlation circle plot highlights the contribution of each selected variable to each component, see . plotVar displays the variables from all blocks, selected on component 1 and 2 (see here for more examples). Clusters of points indicate a strong correlation between variables. For better visibility we choose to hide the variable names:
plotVar(sgccda.res, var.names = FALSE, style = 'graphics', legend = TRUE, pch = c(16, 17, 15), cex = c(2,2,2), col = c('darkorchid', 'brown1', 'lightgreen'))
The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connexions between blocks, expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix, extended to the case of multiple data sets from .
circosPlot(sgccda.res, cutoff = 0.7, line = TRUE, color.blocks= c('darkorchid', 'brown1', 'lightgreen'), color.cor = c("chocolate3","grey20"), size.labels = 1.5)