The aim of this N-integration analysis is to identify a highly correlated multi-omics signature that discriminates the breast cancer subtypes Basal, Her2 and LumA.
Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library
set.seed(123) # for reproducibility, remove for normal use
Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumours can be classified into several subtypes, according to levels of mRNA expression (Sørlie et al., 2001 [1]). Here we consider a subset of data generated by The Cancer Genome Atlas Network (Network et al., 2012 [2]). Data were normalized and drastically pre-filtered for illustrative purposes. The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set includes 70 samples, but only from the mRNA, miRNA and methylation data (proteomics missing).
The mixOmics
TCGA dataset is accessed via `breast.TCGA’
and contains the following:
breast.TCGA$data.train$mirna
(continuous matrix): 150
rows and 184 columns. The expression levels of 184 different sections of
miRNA.breast.TCGA$data.train$mrna
(continuous matrix): 150
rows and 200 columns. The expression levels of 200 different sections of
mRNA.breast.TCGA$data.train$protein
(continuous matrix): 150
rows and 142 columns. The abundance of 142 different proteinsbreast.TCGA$data.train$subtype
(categorical vector):
length of 150. Indicates the breast cancer subtype of each subject.
Includes Basal
, Her2
and
LumA
.To confirm the correct dataframes were extracted, the dimensions are
checked. The distribution of class labels is also examined. It can be
seen that these class labels are not balanced. Note that the dataframes
in the data
list should be named.
data(breast.TCGA) # load in the data
# set a list of all the X dataframes
data = list(miRNA = breast.TCGA$data.train$mirna,
mRNA = breast.TCGA$data.train$mrna,
proteomics = breast.TCGA$data.train$protein)
lapply(data, dim) # check their dimensions
## $miRNA
## [1] 150 184
##
## $mRNA
## [1] 150 200
##
## $proteomics
## [1] 150 142
Y = breast.TCGA$data.train$subtype # set the response variable as the Y df
summary(Y)
## Basal Her2 LumA
## 45 30 75
As mentioned in the methods pages, it is strongly advised that prior to using the DIABLO framework that the data be examinsed in a non-integrative context. Here, the correlation between the top 25 features of each dataframe (in a pairwise fashion) are shown in Figure 1.
list.keepX = c(25, 25) # select arbitrary values of features to keep
list.keepY = c(25, 25)
# generate three pairwise PLS models
pls1 <- spls(data[["miRNA"]], data[["mRNA"]],
keepX = list.keepX, keepY = list.keepY)
pls2 <- spls(data[["miRNA"]], data[["proteomics"]],
keepX = list.keepX, keepY = list.keepY)
pls3 <- spls(data[["mRNA"]], data[["proteomics"]],
keepX = list.keepX, keepY = list.keepY)
# plot features of first PLS
plotVar(pls1, cutoff = 0.5, title = "(a) miRNA vs mRNA",
legend = c("miRNA", "mRNA"),
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
# plot features of second PLS
plotVar(pls2, cutoff = 0.5, title = "(b) miRNA vs proteomics",
legend = c("miRNA", "proteomics"),
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
# plot features of third PLS
plotVar(pls3, cutoff = 0.5, title = "(c) mRNA vs proteomics",
legend = c("mRNA", "proteomics"),
var.names = FALSE, style = 'graphics',
pch = c(16, 17), cex = c(2,2),
col = c('darkorchid', 'lightgreen'))
FIGURE 1: Circle Correlation Plots for pairwise PLS models on the breast TCGA data. Only displays the top 25 features for each dimension, subsetting by those with a correlation above 0.5.
The strong correlation between a large portion of the miRNA and mRNA features along the first component and to a slightly lesser extent on the second component is depicted in (a). Some of the features in (b) have very strong clustering while other do not, suggesting that some features of the mRNA and protein dataframes are highly correlated. The mRNA and protein features in (c) are the most weakly clustered but are the most strongly correlated along the first component specifically.
Below is are correlations between the first component of each dataset for all three PLS models. It can be seen the values are around ~0.8 - 0.9.
# calculate correlation of miRNA and mRNA
cor(pls1$variates$X, pls1$variates$Y)
## comp1
## comp1 0.8841792
# calculate correlation of miRNA and proteins
cor(pls2$variates$X, pls2$variates$Y)
## comp1
## comp1 0.8361993
# calculate correlation of mRNA and proteins
cor(pls3$variates$X, pls3$variates$Y)
## comp1
## comp1 0.9360264
The moderate to high correlation between these features means that an appropriate value for the design matrix would be about ~0.8 - 0.9. However, as discussed in the N-Integration Methods page, values above 0.5 will cause a reduction in predictive ability of the model - and prediction is what is desired in this context. Hence, a value of 0.1 will be used to prioritise the discriminative ability of the model.
# for square matrix filled with 0.1s
design = matrix(0.1, ncol = length(data), nrow = length(data),
dimnames = list(names(data), names(data)))
diag(design) = 0 # set diagonal to 0s
design
## miRNA mRNA proteomics
## miRNA 0.0 0.1 0.1
## mRNA 0.1 0.0 0.1
## proteomics 0.1 0.1 0.0
With a design in place, the initial DIABLO model can be generated. An
arbitrarily high number of components (ncomp = 5
) will be
used.
# form basic DIABLO model
basic.diablo.model = block.splsda(X = data, Y = Y, ncomp = 5, design = design)
To choose the number of components for the final DIABLO model, the
function perf()
is run with 10-fold cross-validation
repeated 10 times.
# run component number tuning with repeated CV
perf.diablo = perf(basic.diablo.model, validation = 'Mfold',
folds = 10, nrepeat = 10)
plot(perf.diablo) # plot output of tuning
FIGURE 2: Choosing the number of components in block.plsda
using perf()
with 10 × 10-fold CV function in the
breast.TCGA
study. Classification error rates (overall and
balanced, see Section 7.3) are represented on the y-axis with respect to
the number of components on the x-axis for each prediction distance
presented in PLS-DA
From the performance plot above (Figure 2) we observe that both
overall and balanced error rate (BER) decrease from 1 to 2 components.
The standard deviation indicates a potential slight gain in adding more
components. The centroids.dist
distance seems to give the
best accuracy (see Supplemental Material in [6]). Considering this
distance and the BER, the output $choice.ncomp
indicates
the optimal number of components for the final DIABLO model.
# set the optimal ncomp value
ncomp = perf.diablo$choice.ncomp$WeightedVote["Overall.BER", "centroids.dist"]
# show the optimal choice for ncomp for each dist metric
perf.diablo$choice.ncomp$WeightedVote
## max.dist centroids.dist mahalanobis.dist
## Overall.ER 3 2 3
## Overall.BER 3 2 3
This tuning function should be used to tune the keepX
parameters in the block.splsda()
function.
We choose the optimal number of variables to select in each data set
using the tune.block.splsda()
function, for a grid of
keepX
values for each type of omics. Note that the function
has been set to favour a relatively small signature while allowing us to
obtain a sufficient number of variables for downstream validation and/or
interpretation. See ?tune.block.splsda
.
The function tune
is run with 10-fold cross validation,
but repeated only once. Note that for a more thorough tuning process,
provided sufficient computational time, we could increase the
nrepeat
argument. Here we have saved the results into an
RData object that is available for download as the tuning can take a
very long time, especially on lower end machines.
# set grid of values for each component to test
test.keepX = list (mRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
miRNA = c(5:9, seq(10, 18, 2), seq(20,30,5)),
proteomics = c(5:9, seq(10, 18, 2), seq(20,30,5)))
# run the feature selection tuning
tune.TCGA = tune.block.splsda(X = data, Y = Y, ncomp = ncomp,
test.keepX = test.keepX, design = design,
validation = 'Mfold', folds = 10, nrepeat = 1,
dist = "centroids.dist")
The number of features to select on each component is returned in
tune.TCGA$choice.keepX
.
list.keepX = tune.TCGA$choice.keepX # set the optimal values of features to retain
list.keepX
## $miRNA
## [1] 10 5
##
## $mRNA
## [1] 25 16
##
## $proteomics
## [1] 8 5
The final DIABLO model is run as:
# set the optimised DIABLO model
final.diablo.model = block.splsda(X = data, Y = Y, ncomp = ncomp,
keepX = list.keepX, design = design)
The warning message informs us that the outcome y has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximised, as shown in the final design output.
final.diablo.model$design # design matrix for the final model
## miRNA mRNA proteomics Y
## miRNA 0.0 0.1 0.1 1
## mRNA 0.1 0.0 0.1 1
## proteomics 0.1 0.1 0.0 1
## Y 1.0 1.0 1.0 0
The selected variables can be extracted with the function
selectVar()
, for example in the mRNA block, as seen below.
Note that the stability of selected variables can be extracted from the
output of the perf()
function.
# the features selected to form the first component
selectVar(final.diablo.model, block = 'mRNA', comp = 1)$mRNA$name
## [1] "ZNF552" "KDM4B" "CCNA2" "LRIG1" "PREX1" "FUT8" "C4orf34"
## [8] "TTC39A" "ASPM" "SLC43A3" "MEX3A" "SEMA3C" "E2F1" "STC2"
## [15] "FMNL2" "LMO4" "MED13L" "DTWD2" "CSRP2" "NTN4" "KIF13B"
## [22] "NCAPG2" "SLC19A2" "EPHB3" "FAM63A"
plotDIABLO()
is a diagnostic plot to check whether the
correlation between components from each data set has been maximised as
specified in the design matrix. We specify which dimension to be
assessed with the ncomp
argument.
plotDiablo(final.diablo.model, ncomp = 1)
FIGURE 3: Diagnostic plot from multiblock sPLS-DA applied on the
breast.TCGA
study. Samples are represented based on the
specified component (here ncomp = 1
) for each data set
(mRNA, miRNA and protein). Samples are coloured by breast cancer subtype
and 95% confidence ellipse plots are represented.
As can be seen in Figure 3, the first components from each data set are highly correlated to each other (indicated by the large numbers in the bottom left). The colours and ellipses related to the sample subtypes indicate the discriminative power of each component to separate the different tumour subtypes. For the first component, the centroids of each subtype are distinct but there is a moderate amount of overlap between each sample group in their confidence ellipses.
The sample plot with the plotIndiv()
function projects
each sample into the space spanned by the components of each block
(Figure 4). Clustering of the samples can be better assessed with this
plot. It seems that the quality of clustering of the mRNA data is the
highest with the miRNA clustering quality being the lowest. This
suggests that the mRNA is likely to hold more discriminative power
within the model.
plotIndiv(final.diablo.model, ind.names = FALSE, legend = TRUE,
title = 'DIABLO Sample Plots')
FIGURE 4: Sample plot from multiblock sPLS-DA performed on the
breast.TCGA
study. The samples are plotted according to
their scores on the first 2 components for each data set. Samples are
coloured by cancer subtype
In the arrow plot below (Figure), the start of the arrow indicates
the centroid between all data sets for a given sample and the tips of
the arrows indicate the location of that sample in each block. Such
graphics highlight the agreement between all data sets at the sample
level. While somewhat difficult to interpret, even qualitatively, Figure
5 shows that the agreement within the LumA
group seems to
be the highest and lowest in the Her2
group.
plotArrow(final.diablo.model, ind.names = FALSE, legend = TRUE,
title = 'DIABLO')
FIGURE 5: Arrow plot from multiblock sPLS-DA performed on the
breast.TCGA
study. The samples are projected into the space
spanned by the first two components for each data set then overlaid
across data sets.
Several graphical outputs are available to visualise and mine the associations between the selected variables.
The best starting point to evaluate the correlation structure between variables is with the correlation circle plot, depicted in Figure 6. A majority of the miRNA variables are positively correlated with the first component while the mRNA variables seem to separate along this dimension. These first two components correlate highly with the selected variables from the proteomics dataset. From this, the correlation of each selected feature from all three datasets can be evaluated based on their proximity. s
plotVar(final.diablo.model, var.names = FALSE,
style = 'graphics', legend = TRUE,
pch = c(16, 17, 15), cex = c(2,2,2),
col = c('darkorchid', 'brown1', 'lightgreen'))
FIGURE 6: Correlation circle plot from multiblock sPLS-DA performed on
the breast.TCGA
study. Variable types are indicated with
different symbols and colours, and are overlaid on the same plot.
The circos plot is exclusive to integrative frameworks and represents
the correlations between variables of different types, represented on
the side quadrants. From Figure 7, it seems that the miRNA variables are
almost entirely negatively correlated with the other two dataframes. The
proteomics features are the opposite, such that they display primarily
positive correlations while the mRNA variables are more mixed. Note that
these correlations are above a value of 0.7 (cutoff = 0.7
).
All the interpretations made above are only relevant for features with
very strong correlations.
circosPlot(final.diablo.model, cutoff = 0.7, line = TRUE,
color.blocks= c('darkorchid', 'brown1', 'lightgreen'),
color.cor = c("chocolate3","grey20"), size.labels = 1.5)
FIGURE 7: Circos plot from multiblock sPLS-DA performed on the
breast.TCGA
study. The plot represents the correlations
greater than 0.7 between variables of different types, represented on
the side quadrants
Another visualisation of the correlations between the different types
of variables is the relevance network, which is also built on the
similarity matrix (as is the circos plot). Each colour represents a type
of variable. Figure 8 shows this network which has a lower cutoff an
Figure 7 (cutoff = 0.4
). Two distinct clusters can be
observed, though due to the density of the plot the relationships within
the cluster are hard to determine. The interactive version of this plot
would be useful here.
network(final.diablo.model, blocks = c(1,2,3),
color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
FIGURE 8: Relevance network for the variables selected by multiblock
sPLS-DA performed on the breast.TCGA
study on component 1.
Each node represents a selected with colours indicating their type. The
colour of the edges represent positive or negative correlations
The network can be saved in a .gml
format to be input
into the software Cytoscape, using the R
package
igraph
. An example is shown directly below.
library(igraph)
my.network = network(final.diablo.model, blocks = c(1,2,3),
color.node = c('darkorchid', 'brown1', 'lightgreen'), cutoff = 0.4)
write.graph(my.network$gR, file = "myNetwork.gml", format = "gml")
The function plotLoadings()
visualises the loading
weights of each selected variable on each component and each data set
(Figure 9). The colour indicates the class in which the variable has the
maximum level of expression (contrib = 'max'
) using the
median (method = 'median'
). Figure 9 depicts the loading
values for the second dimension.
plotLoadings(final.diablo.model, comp = 2, contrib = 'max', method = 'median')
FIGURE 9: Loading plot for the variables selected by multiblock sPLS-DA
performed on the breast.TCGA
study on component 1. The most
important variables (according to the absolute value of their
coefficients) are ordered from bottom to top. As this is a supervised
analysis, colours indicate the class for which the median expression
value is the highest for each feature
The cimDIABLO()
function is a clustered image map
specifically implemented to represent the multi-omics molecular
signature expression for each sample. From Figure 10, the areas of
homogeneous expression levels for a set of samples across a set of
features can be determined. For instance, the Her2
samples
were the only group to show extremely high levels of expression for a
specific set of proteins (seen by the small, red block in the middle
bottom of Figure 10). This indicates these features are fairly
discriminating for this subtype.
cimDiablo(final.diablo.model)
##
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo
FIGURE 10: Clustered Image Map for the variables selected by multiblock
sPLS-DA performed on the breast.TCGA
study on component 1.
By default, Euclidean distance and Complete linkage methods are used.
The CIM represents samples in rows (indicated by their breast cancer
subtype on the left hand side of the plot) and selected features in
columns (indicated by their data type at the top of the plot).
We assess the performance of the model using 10-fold cross-validation
repeated 10 times, using the function perf()
. The method
runs a block.splsda()
model on the pre-specified arguments
input from the final.diablo.model
object, but on
cross-validated samples. We then assess the accuracy of the prediction
on the left-out samples.
In addition to the usual (balanced) classification error rates,
predicted dummy variables and variates, as well as the stability of the
selected features, the perf()
function for DIABLO outputs
the performance based on Majority Vote (each data set votes for a class
for a particular test sample) or a weighted vote, where the weight is
defined according to the correlation between the latent component
associated to a particular data set and the outcome.
Since the tune()
function was used with the
centroid.dist
argument, the outputs of the
perf()
function for that same distance are examined. The
following code may take a few minutes to run.
# run repeated CV performance evaluation
perf.diablo = perf(final.diablo.model, validation = 'Mfold',
M = 10, nrepeat = 10,
dist = 'centroids.dist')
perf.diablo$MajorityVote.error.rate
## $centroids.dist
## comp1 comp2
## Basal 0.03111111 0.04444444
## Her2 0.16000000 0.16666667
## LumA 0.06666667 0.01733333
## Overall.ER 0.07466667 0.05533333
## Overall.BER 0.08592593 0.07614815
perf.diablo$WeightedVote.error.rate
## $centroids.dist
## comp1 comp2
## Basal 0.01111111 0.04444444
## Her2 0.09333333 0.13333333
## LumA 0.06666667 0.01733333
## Overall.ER 0.05533333 0.04866667
## Overall.BER 0.05703704 0.06503704
From the above output, it can be seen that the error rate across the board is quite low, indicating the constructed DIABLO model does a fairly good job of classifying novel samples.
An AUC plot per block can also be obtained using the function
auroc()
. The interpretation of this output may not be
particularly insightful in relation to the performance evaluation of our
methods, but can complement the statistical analysis..
auc.splsda = auroc(final.diablo.model, roc.block = "miRNA",
roc.comp = 2, print = FALSE)
FIGURE 11: ROC and AUC based on multiblock sPLS-DA performed on the
breast.TCGA
study for the miRNA data set after 2
components. The function calculates the ROC curve and AUC for one class
vs. the others.
The predict()
function predicts the class of samples
from a test set. In our specific case, one data set is missing in the
test set but the method can still be applied. Make sure the name of the
blocks correspond exactly.
data.test.TCGA = list(mRNA = breast.TCGA$data.test$mrna,
miRNA = breast.TCGA$data.test$mirna)
predict.diablo = predict(final.diablo.model, newdata = data.test.TCGA)
The confusion table compares the real subtypes with the predicted subtypes for a 2-component model, for the distance of interest. This model performs quite well as it makes only two errors:
confusion.mat = get.confusion_matrix(truth = breast.TCGA$data.test$subtype,
predicted = predict.diablo$WeightedVote$centroids.dist[,2])
confusion.mat
## predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal 20 1 0
## Her2 0 14 0
## LumA 0 1 34
These two errors correspond to a very low balanced error rate.
get.BER(confusion.mat)
## [1] 0.02539683