Using kernel transformations allows for linearly inseparable data to
be converted to a space where it can be linearly separable. The
mixKernel
package provides methods that combine kernels to
improve the unsupervised exploration of a set of data. This vignette
aims to undergo an unsupervised analysis on the Tara Ocean data
utilising the Kernel PCA method.
Load the latest version of mixOmics
and
mixKernel
.
library(mixOmics) # import the mixOmics library
library(mixKernel) # import the mixKernel library
The TARA Oceans expedition facilitated the study of plankton communities by providing oceans metagenomic data combined with environmental measures to the scientific community. This study focuses on 139 prokaryotic-enriched samples collected from 68 stations and spread across three depth layers. Samples were located in 8 different oceans or seas.
This case study only considers a subset of this data (published by Mariette & Villa-Vialaneix (INRA Toulouse, France) [1]). The data include 1% of the 35,650 prokaryotic OTUs and of the 39,246 bacterial genes that were randomly selected.
The mixKernel
Tara Ocean dataset is accessed via
TARAoceans
and contains the following:
TARAoceans$phychem
(continuous matrix): 139 rows and 22
columns. Each row represents a sample and each column an environmental
variable.TARAoceans$pro.phylo
(continuous matrix): 139 rows and
356 columns. Contains information on the prokaryotic OTUs (Operational
Taxonomic Unit).TARAoceans$taxonomy
(categorical matrix): 356 rows and
6 columns. Indicates the taxonomy of each OTU represented in the
pro.phylo
dataframe.TARAoceans$phylogenetic.tree
(phylo
object): represents the prokaryotic OTU phylogenetic tree (see the
ape
package).TARAoceans$pro.NOGs
(categorical matrix): 139 rows and
638 columns. Contains NOG data.TARAoceans$sample
(categorical vector): a list
containing three following entries (all three are character vectors each
of legnth 139): name
(sample name), ocean
(oceanic region of the sample) and depth
(sample
depth)For each inputted dataset, a kernel is computed using the
compute.kernel()
function. It can be constructed in a
linear ("linear"
), phylogenic ("phylogenetic"
)
or abundance ("abundance"
) manner. The
kernel.func
parameter can also be used to pass in a
specific, user defined kernel. After applying this function, the
resulting object has a kernel
component which contains the
kernel matrix - a symmetric matrix with a size equal to the number of
samples in the input dataset.
phychem.kernel = compute.kernel(TARAoceans$phychem,
kernel.func = "linear")
pro.phylo.kernel = compute.kernel(TARAoceans$pro.phylo,
kernel.func = "abundance")
pro.NOGs.kernel = compute.kernel(TARAoceans$pro.NOGs,
kernel.func = "abundance")
The cim.kernel()
function allows for a general overview
of the correlation structure between datasets. From Figure 1, the
pro.phylo
and pro.NOGs
datasets are the most
highly (and positively) correlated. This result is expected as both
kernels provide a summary of prokaryotic communities.
cim.kernel(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "square")
FIGURE 1: Simple Clustered Image Map depicting the correlation structure between the three inputted datasets. The colour uses the legend (on the right) to denote the correlation strength.
The function combine.kernels()
implements 3 different
methods for combining kernels: STATIS-UMKL
,
sparse-UMKL
and full-UMKL
(see more details in
Mariette and Villa-Vialaneix, 2017). It returns a meta-kernel that can
be used as an input for the function kernel.pca()
. The
three methods bring complementary information and must be chosen
according to the research question.
The STATIS-UMKL
approach gives an overview on the common
information between the different datasets. The full-UMKL
computes a kernel that minimizes the distortion between all input
kernels. sparse-UMKL
is a sparse variant of full-UMKL but
also selects the most relevant kernels.
meta.kernel = combine.kernels(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "full-UMKL")
KPCA utilises the generated kernels to improve the performance of the
standard PCA algorithm. Data that otherwise could not have its
dimensions reduced by PCA can under dimension reduction via KPCA. A KPCA
is applied here on the combined kernel generated above vua the
kernel.pca()
function. The ncomp
parameter
controls the number of novel components to extract.
kernel.pca.result = kernel.pca(meta.kernel, ncomp = 10)
The projection of the samples onto the new components (within the kernel) is shown below in Figure 2.
all.depths = levels(factor(TARAoceans$sample$depth))
depth.pch = c(20, 17, 4, 3)[match(TARAoceans$sample$depth, all.depths)]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = TRUE,
group = as.vector(TARAoceans$sample$ocean),
col.per.group = c("#f99943", "#44a7c4", "#05b052", "#2f6395",
"#bb5352", "#87c242", "#07080a", "#92bbdb"),
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Projection of TARA Oceans stations",
size.title = 10,
legend.title.pch = "Depth")
FIGURE 2: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
It can be seen in Figure 3 that the first component of this kernel explains nearly 20% of the data’s variance.
plot(kernel.pca.result)
FIGURE 3: Explained variance of each principal component of a KPCA run over the meta kernel of the three inputted dataframes.
Here we focus on the information summarised on the first component.
Variables values are randomly permuted with the function
permute.kernel.pca()
.
In the following example, physical variables are permuted at the
variable level (kernel phychem
), OTU abundances from
pro.phylo
kernel are permuted at the phylum level (OTU
phyla are stored in the second column, named Phylum
, of the
taxonomy annotation provided in TARAoceans
object in the
entry taxonomy
) and gene abundances from
pro.NOGs
are permuted at the GO level (GO are provided in
the entry TARAoceans$GO
of the dataset).
The kernel.pca.permute()
function determines the
importance of a given variable by computing the Crone-Crosby distance
[2] between the original vector of values and a vector of these values
that have been permuted randomly.
# here we set a seed for reproducible results with this tutorial
set.seed(17051753)
kernel.pca.result = kernel.pca.permute(kernel.pca.result, ncomp = 1,
phychem = colnames(TARAoceans$phychem),
pro.phylo = TARAoceans$taxonomy[ ,"Phylum"],
pro.NOGs = TARAoceans$GO)
Results are displayed with the function
plotVar.kernel.pca()
. The argument ndisplay
indicates the number of variables to display for each kernel. For each
of the three dataframes, the features which contribute to first
component (ncomp = 1
in the
kernel.pca.permute()
function) to the greatest degree are
shown (Figure 4).
plotVar.kernel.pca(kernel.pca.result, ndisplay = 10, ncol = 3)
FIGURE 4: Importance of the original features to each kernel. The height of each column indicates that features contribution to the first component of the corresponding kernel.
Proteobacteria
is the most important variable for the
pro.phylo
kernel.
The relative abundance of Proteobacteria
is then
extracted in each of our 139 samples, and each sample is coloured
according to the value of this variable in the KPCA projection plot:
# subset features to those which belong to this Phylum
selected = which(TARAoceans$taxonomy[ ,"Phylum"] == "Proteobacteria")
# for each feature determine the proportion of samples
# which belong to the Proteobacteria phylum
proteobacteria.per.sample = apply(TARAoceans$pro.phylo[ ,selected], 1, sum) /
apply(TARAoceans$pro.phylo, 1, sum)
# set colour
colfunc = colorRampPalette(c("royalblue", "red"))
col.proteo = colfunc(length(proteobacteria.per.sample))
col.proteo = col.proteo[rank(proteobacteria.per.sample, ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.proteo,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of Proteobacteria abundance",
legend.title.pch = "Depth")
FIGURE 5: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
Similarly, the temperature is the most important variable for the
phychem
kernel. The temperature values can be displayed on
the kernel PCA projection as follows:
col.temp = colfunc(length(TARAoceans$phychem[ ,4]))
col.temp = col.temp[rank(TARAoceans$phychem[ ,4], ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.temp,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of mean temperature",
legend.title.pch = "Depth")
FIGURE 6: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.