Case Study of MixKernel with Tara Ocean data
Using kernel transformations allows for linearly inseparable data to be converted to a space where it can be linearly separable. The mixKernel
package provides methods that combine kernels to improve the unsupervised exploration of a set of data. This vignette aims to undergo an unsupervised analysis on the Tara Ocean data utilising the Kernel PCA method.
R script
The R script used for all the analysis in this case study is available here.
To begin
Load the latest version of mixOmics
and mixKernel
.
library(mixOmics) # import the mixOmics library
library(mixKernel) # import the mixKernel library
The data
The TARA Oceans expedition facilitated the study of plankton communities by providing oceans metagenomic data combined with environmental measures to the scientific community. This study focuses on 139 prokaryotic-enriched samples collected from 68 stations and spread across three depth layers. Samples were located in 8 different oceans or seas.
This case study only considers a subset of this data (published by Mariette & Villa-Vialaneix (INRA Toulouse, France) [1]). The data include 1% of the 35,650 prokaryotic OTUs and of the 39,246 bacterial genes that were randomly selected.
The mixKernel
Tara Ocean dataset is accessed via TARAoceans
and contains the following:
TARAoceans$phychem
(continuous matrix): 139 rows and 22 columns. Each row represents a sample and each column an environmental variable.TARAoceans$pro.phylo
(continuous matrix): 139 rows and 356 columns. Contains information on the prokaryotic OTUs (Operational Taxonomic Unit).TARAoceans$taxonomy
(categorical matrix): 356 rows and 6 columns. Indicates the taxonomy of each OTU represented in thepro.phylo
dataframe.TARAoceans$phylogenetic.tree
(phylo
object): represents the prokaryotic OTU phylogenetic tree (see theape
package).TARAoceans$pro.NOGs
(categorical matrix): 139 rows and 638 columns. Contains NOG data.TARAoceans$sample
(categorical vector): a list containing three following entries (all three are character vectors each of legnth 139):name
(sample name),ocean
(oceanic region of the sample) anddepth
(sample depth)
Individual Kernel Computation
For each inputted dataset, a kernel is computed using the compute.kernel()
function. It can be constructed in a linear ("linear"
), phylogenic ("phylogenetic"
) or abundance ("abundance"
) manner. The kernel.func
parameter can also be used to pass in a specific, user defined kernel. After applying this function, the resulting object has a kernel
component which contains the kernel matrix – a symmetric matrix with a size equal to the number of samples in the input dataset.
phychem.kernel = compute.kernel(TARAoceans$phychem,
kernel.func = "linear")
pro.phylo.kernel = compute.kernel(TARAoceans$pro.phylo,
kernel.func = "abundance")
pro.NOGs.kernel = compute.kernel(TARAoceans$pro.NOGs,
kernel.func = "abundance")
The cim.kernel()
function allows for a general overview of the correlation structure between datasets. From Figure 1, the pro.phylo
and pro.NOGs
datasets are the most highly (and positively) correlated. This result is expected as both kernels provide a summary of prokaryotic communities.
cim.kernel(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "square")
FIGURE 1: Simple Clustered Image Map depicting the correlation structure between the three inputted datasets. The colour uses the legend (on the right) to denote the correlation strength.
Combined Kernel Computation
The function combine.kernels()
implements 3 different methods for combining kernels: STATIS-UMKL
, sparse-UMKL
and full-UMKL
(see more details in Mariette and Villa-Vialaneix, 2017). It returns a meta-kernel that can be used as an input for the function kernel.pca()
. The three methods bring complementary information and must be chosen according to the research question.
The STATIS-UMKL
approach gives an overview on the common information between the different datasets. The full-UMKL
computes a kernel that minimizes the distortion between all input kernels. sparse-UMKL
is a sparse variant of full-UMKL but also selects the most relevant kernels.
meta.kernel = combine.kernels(phychem = phychem.kernel,
pro.phylo = pro.phylo.kernel,
pro.NOGs = pro.NOGs.kernel,
method = "full-UMKL")
Kernel Principal Component Analysis (KPCA)
KPCA utilises the generated kernels to improve the performance of the standard PCA algorithm. Data that otherwise could not have its dimensions reduced by PCA can under dimension reduction via KPCA. A KPCA is applied here on the combined kernel generated above vua the kernel.pca()
function. The ncomp
parameter controls the number of novel components to extract.
kernel.pca.result = kernel.pca(meta.kernel, ncomp = 10)
The projection of the samples onto the new components (within the kernel) is shown below in Figure 2.
all.depths = levels(factor(TARAoceans$sample$depth))
depth.pch = c(20, 17, 4, 3)[match(TARAoceans$sample$depth, all.depths)]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = TRUE,
group = as.vector(TARAoceans$sample$ocean),
col.per.group = c("#f99943", "#44a7c4", "#05b052", "#2f6395",
"#bb5352", "#87c242", "#07080a", "#92bbdb"),
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Projection of TARA Oceans stations",
size.title = 10,
legend.title.pch = "Depth")
FIGURE 2: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
It can be seen in Figure 3 that the first component of this kernel explains nearly 20% of the data’s variance.
plot(kernel.pca.result)
FIGURE 3: Explained variance of each principal component of a KPCA run over the meta kernel of the three inputted dataframes.
Assessing important variables
Here we focus on the information summarised on the first component. Variables values are randomly permuted with the function permute.kernel.pca()
.
In the following example, physical variables are permuted at the variable level (kernel phychem
), OTU abundances from pro.phylo
kernel are permuted at the phylum level (OTU phyla are stored in the second column, named Phylum
, of the taxonomy annotation provided in TARAoceans
object in the entry taxonomy
) and gene abundances from pro.NOGs
are permuted at the GO level (GO are provided in the entry TARAoceans$GO
of the dataset).
The kernel.pca.permute()
function determines the importance of a given variable by computing the Crone-Crosby distance [2] between the original vector of values and a vector of these values that have been permuted randomly.
# here we set a seed for reproducible results with this tutorial
set.seed(17051753)
kernel.pca.result = kernel.pca.permute(kernel.pca.result, ncomp = 1,
phychem = colnames(TARAoceans$phychem),
pro.phylo = TARAoceans$taxonomy[ ,"Phylum"],
pro.NOGs = TARAoceans$GO)
Results are displayed with the function plotVar.kernel.pca()
. The argument ndisplay
indicates the number of variables to display for each kernel. For each of the three dataframes, the features which contribute to first component (ncomp = 1
in the kernel.pca.permute()
function) to the greatest degree are shown (Figure 4).
plotVar.kernel.pca(kernel.pca.result, ndisplay = 10, ncol = 3)
FIGURE 4: Importance of the original features to each kernel. The height of each column indicates that features contribution to the first component of the corresponding kernel.
Proteobacteria
is the most important variable for the pro.phylo
kernel.
The relative abundance of Proteobacteria
is then extracted in each of our 139 samples, and each sample is coloured according to the value of this variable in the KPCA projection plot:
# subset features to those which belong to this Phylum
selected = which(TARAoceans$taxonomy[ ,"Phylum"] == "Proteobacteria")
# for each feature determine the proportion of samples
# which belong to the Proteobacteria phylum
proteobacteria.per.sample = apply(TARAoceans$pro.phylo[ ,selected], 1, sum) /
apply(TARAoceans$pro.phylo, 1, sum)
# set colour
colfunc = colorRampPalette(c("royalblue", "red"))
col.proteo = colfunc(length(proteobacteria.per.sample))
col.proteo = col.proteo[rank(proteobacteria.per.sample, ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.proteo,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of Proteobacteria abundance",
legend.title.pch = "Depth")
FIGURE 5: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
Similarly, the temperature is the most important variable for the phychem
kernel. The temperature values can be displayed on the kernel PCA projection as follows:
col.temp = colfunc(length(TARAoceans$phychem[ ,4]))
col.temp = col.temp[rank(TARAoceans$phychem[ ,4], ties = "first")]
plotIndiv(kernel.pca.result,
comp = c(1, 2),
ind.names = FALSE,
legend = FALSE,
group = c(1:139),
col = col.temp,
pch = depth.pch,
pch.levels = TARAoceans$sample$depth,
legend.title = "Ocean / Sea",
title = "Representation of mean temperature",
legend.title.pch = "Depth")
FIGURE 6: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.