Canonical Correlation Analysis (CCA) and its regularised counterpart (rCCA) are multivariate, integrative methods used to analyse the correlation structure between two datasets. While similar to PLS, it looks to maximise the correlation between canonical variates in order to reduce dimensionality.
Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library
set.seed(5249) # for reproducibility, remove for normal use
The nutrimouse data contains the expression measure of 120 genes potentially involved in nutritional problems and the concentrations of 21 hepatic fatty acids for forty mice (Martin et al., 2007).
The mixOmics
nutrimouse dataset is accessed via
nutrimouse
and contains the following:
nutrimouse$gene
(continuous matrix): 40 rows and 120
columns. Expression levels of 120 genes measured in liver cells,
selected (among about 30,000) as potentially relevant in the context of
the nutrition study.nutrimouse$lipid
(continuous matrix): 40 rows and 21
columns. Concentration (in percentage) of 21 hepatic fatty acids
measured by gas chromatographynutrimouse$genotype
(binary vector): 40 rows. Contains
binary indicator whether the corresponding mouse had the wildtype or
PPAR -/- genotype.nutrimouse$diet
(categorical vector): 40 rows. Oils
used for experimental diets were: corn and colza oils (50/50) for a
reference diet (REF), hydrogenated coconut oil for a saturated fatty
acid diet (COC), sunflower oil for an Omega6 fatty acid-rich diet (SUN),
linseed oil for an Omega3-rich diet (LIN) and corn/colza/enriched fish
oils for the FISH diet (43/43/14).To confirm the correct dataframes were extracted, the dimensions of each are checked.
data(nutrimouse)
X <- nutrimouse$lipid # extract all lipid concentration variables
Y <- nutrimouse$gene # extract all gene expression variables
dim(X) # check the dimensions of the X dataframe
## [1] 40 21
dim(Y) # check the dimensions of the Y dataframe
## [1] 40 120
A good stepping off place is to look at the raw correlations between
the total set of features (both inter-dataset and intra-dataset). The
heatmap in Figure 1 utilises the imgCor()
function. This
image provides a good insight into the correlation structure of each
data set both together but also separately (indicated by the purple
(X) and green (Y) rectangles). The
color key indicates the range of negative (blue) to positive (red)
correlations. For example, the genetic expression data features are
largely positively correlated with one another, while the correlations
between the two datasets are fairly variable in magnitude and
direction.
# produce a heat map of the cross correlation matrix
imgCor(X, Y, sideColors = c("purple", "green"))
FIGURE 1: Cross-correlation matrix in the nutrimouse study between the gene and lipid data sets. Note that the representation is symmetrical around the diagonal, so we only need to interpret the upper or lower side of this plot.
Unlike the other methods in the mixOmics
package, there
is actually no need to form a basic model for tuning. Optimisation (in
the form of regularisation) is done externally using the
tune.rcc()
function or within the rcc()
function.
The regularisation method can be selected based on the dataset sizes
and available time. Using the ridge
method can be long and
costly on large datasets, but takes into account the cross-correlations
between the datasets. This method requires the usage of the
tune.rcc()
function to optimise \(\lambda1\) and \(\lambda2\).
Utilising the shrinkage
method will save time and is
implemented within the rcc()
function. This is much more
appropriate for large-scale datasets, with the drawback that the \(\lambda1\) and \(\lambda2\) will be optimised independently
of one another.
This method utilises a grid search functionality to optimise \(\lambda1\) and \(\lambda2\) through the
tune.rcc()
function. By default, the grid is quite coarse
([0.001, 1] split evenly into five values for both \(\lambda1\) and \(\lambda2\)). The granularity of the grid
may need to be tweaked by the user.
tune.rcc()
will produce a heatmap that can be used to
select \(\lambda1\) and \(\lambda2\), where the highest
cross-validated score is desired. This can be seen in Figure 2.
# set grid search values for each regularisation parameter
grid1 <- seq(0.001, 0.2, length = 10)
grid2 <- seq(0.001, 0.2, length = 10)
# optimise the regularisation parameter values
cv.tune.rcc.nutrimouse <- tune.rcc(X, Y, grid1 = grid1, grid2 = grid2,
validation = "loo")
FIGURE 2: Heatmap of lambda1 and lambda2 values coloured by the resulting cross-validation score
The opt.lambda1
and opt.lambda2
components
can be used to directly select the best parameter choice from the output
of tune.rcc
, as follows:
cv.tune.rcc.nutrimouse # examine the results of CV tuning
##
## Call:
## tune.rcc(X = X, Y = Y, grid1 = grid1, grid2 = grid2, validation = "loo")
##
## lambda1 = 0.1115556 , see object$opt.lambda1
## lambda2 = 0.02311111 , see object$opt.lambda2
## CV-score = 0.8711438 , see object$opt.score
opt.l1 <- cv.tune.rcc.nutrimouse$opt.lambda1 # extract the optimal lambda values
opt.l2 <- cv.tune.rcc.nutrimouse$opt.lambda2
# formed optimised CV rCCA
CV.rcc.nutrimouse <- rcc(X, Y, method = "ridge",
lambda1 = opt.l1, lambda2 = opt.l2)
This method is integrated within the rcc()
function,
therefore does not require the use of tune.rcc()
. The exact
tuned values of \(\lambda1\), \(\lambda2\) can be extracted through the
$lambda
component of the rcc
object.
# run the rCCA method using shrinkage
shrink.rcc.nutrimouse <- rcc(X,Y, method = 'shrinkage')
# examine the optimal lambda values after shrinkage
shrink.rcc.nutrimouse$lambda
## lambda1 lambda2
## 0.1390342 0.1359755
The maximum number of producible pairs of canonical variates is \(min(P, Q)\), where \(P\) and \(Q\) are the number of features in the X and Y datasets respectively. Canonical correlations for each pair of canonical variates can be plotted using a barplot. A steep drop off in canonical correlation indicates less useful pairs of canonical variates (ie. the ‘elbow’ method).
Barplots for each regularisation method is shown below in Figure 3. While less explicit in the shrinkage method plot, a choice of three dimensions would be appropriate in each case.
# barplot of cross validation method rCCA canonical correlations
plot(CV.rcc.nutrimouse, type = "barplot", main = "Cross Validation")
# barplot of shrinkage method rCCA canonical correlations
plot(shrink.rcc.nutrimouse, type = "barplot", main = "Shrinkage")
FIGURE 3: Barplots showing canonical correlation values for each novel dimension. Left diagram depicts values using the cross-valiation method, right diagram depicts values using shrinkage method
The two models generated above are both adequate. However, in the case of the ridge method it is recommended to repeat the tuning step multiple times, increasing the resolution of tested lambda values to reach an optimum.
rCCA is an unsupervised approach that focuses on maximising the
correlation between the two data sets X and
Y and therefore the information about the treatments or
groups is not taken into account in the analysis. In Figure 4, the
different treatments are coloured to illustrate how the samples are
clustered on the sample plot. Both methods’ first variate seem to
separate the genotype quite well (indicated by wt
and
ppar
). The shrinkage method seems better at characterising
the coc
diet, but not the other types of diets. It is hard
to determine qualitatively which of these plots is “better” - i.e. which
method would be better to use on this data.
# plot the projection of samples for CV rCCA data
plotIndiv(CV.rcc.nutrimouse, comp = 1:2,
ind.names = nutrimouse$genotype,
group = nutrimouse$diet, rep.space = "XY-variate",
legend = TRUE, title = '(a) Nutrimouse, rCCA CV XY-space')
# plot the projection of samples for shrinkage rCCA data
plotIndiv(shrink.rcc.nutrimouse, comp = 1:2,
ind.names = nutrimouse$genotype,
group = nutrimouse$diet, rep.space = "XY-variate",
legend = TRUE, title = '(b) Nutrimouse, rCCA shrinkage XY-space')
FIGURE 4: rCCA sample plots from the CV (a) or shrinkage (b) method.
Canonical variates corresponding to each data set are first averaged
using the argument rep.space = 'XY-variate'
. Samples are
projected into the space spanned by the averaged canonical variates and
coloured according to genotype information.
The next sample plot to utilise is the arrow plot. From this, the relation between each sample’s projection into each dataset’s variate space can be evaluated. The start of each arrow represents the sample’s position in the space spanned by the X variates while the tip represents its position in the Y variate space. The longer each arrow, the higher the disparity between the datasets. The shrinkage method, shown in Figure 5(b), yields variates which produce much more homogeneous sample projections, depicted by the considerably shorter average arrow length when compared with Figure 5(a). From this figure alone, it would seem that the shrinkage method is the more appropriate choice in this context.
# plot the arrow plot of samples for CV rCCA data
plotArrow(CV.rcc.nutrimouse, group = nutrimouse$diet,
col.per.group = color.mixo(1:5),
title = '(a) Nutrimouse, CV method')
# plot the arrow plot of samples for shrinkage rCCA data
plotArrow(shrink.rcc.nutrimouse, group = nutrimouse$diet,
col.per.group = color.mixo(1:5),
title = '(b) Nutrimouse, shrinkage method')
FIGURE 5: Arrow sample plots from the rCCA performed on the nutrimouse data to represent the samples projected onto the first two canonical variates. Each arrow represents one sample.
Another important step is analysing the correlation structure between
features and the canonical variates. Figure 6(a) and (b) depict the
correlation circle plots for the CV and shrinkage methods respectively.
Each use a cutoff
of 0.5 to prevent features with a
negligible correlation being depicted. Figure 6 shows that the main
correlations between genes and lipids appear on the first dimension, for
example: GSTPi2
with C20.5n.3
and
C22.6n.3
in the top right corner; THIOL
and
C.16.0
in the far right and CAR1
,
ACOTH
and C.20.1n.9
in the far left of both
plots. Differences in interpretation appear in the second dimension
where clustering differs much more. Determining which method is
appropriate is based on the biological question. For instance, if it is
equally important to separate all diets, then the shrinkage method may
be more useful.
plotVar(CV.rcc.nutrimouse, var.names = c(TRUE, TRUE),
cex = c(4, 4), cutoff = 0.5,
title = '(a) Nutrimouse, rCCA CV comp 1 - 2')
plotVar(shrink.rcc.nutrimouse, var.names = c(TRUE, TRUE),
cex = c(4, 4), cutoff = 0.5,
title = '(b) Nutrimouse, rCCA shrinkage comp 1 - 2')
FIGURE 6: Correlation circle plots from the rCCA performed on the nutrimouse data showing the correlation between the gene expression and lipid concentration data.
A more robust way to evaluate the correlation structure is through a relevance network plot, as seen in Figure 7. The circular nodes represent genes and the rectangular nodes represent lipids. The bipartite relationships between these non-negligibly correlated features can be seen by the colour of the connecting lines (edges). Figure 7 shows that there is a reasonably complex correlation structure. However, there seems to be roughly three clusters, each corresponding to a group of three genes, a group of five genes and a single gene. The group of three genes has some negative correlations while the rest of the network seems to be positively correlated.
network(CV.rcc.nutrimouse, comp = 1:2, interactive = FALSE,
lwd.edge = 2,
cutoff = 0.5)
FIGURE 7: Relevance network plot from the rCCA performed on the nutrimouse data. This shows the correlation structure for all bipartite relationships with a correlation above 0.5.
To complement the network plot, a CIM can be used (Figure 8). While Figure 8 does not have a cutoff which aids in determining the number of feature clusters, the correlations seen here do reflect that seen in Figure 7 - such that there are roughly three clusters.
cim(CV.rcc.nutrimouse, comp = 1:2, xlab = "genes", ylab = "lipids")
FIGURE 8: Cluster Image Map from the rCCA performed on the nutrimouse data. This shows the correlation structure for all bipartite relationships