Case Study of MixKernel with Tara Ocean data
Using kernel transformations allows for linearly inseparable data to be converted to a space where it can be linearly separable. The
mixKernel package provides methods that combine kernels to improve the unsupervised exploration of a set of data. This vignette aims to undergo an unsupervised analysis on the Tara Ocean data utilising the Kernel PCA method.
The R script used for all the analysis in this case study is available here.
Load the latest version of
library(mixOmics) # import the mixOmics library library(mixKernel) # import the mixKernel library
The TARA Oceans expedition facilitated the study of plankton communities by providing oceans metagenomic data combined with environmental measures to the scientific community. This study focuses on 139 prokaryotic-enriched samples collected from 68 stations and spread across three depth layers. Samples were located in 8 different oceans or seas.
This case study only considers a subset of this data (published by Mariette & Villa-Vialaneix (INRA Toulouse, France) ). The data include 1% of the 35,650 prokaryotic OTUs and of the 39,246 bacterial genes that were randomly selected.
mixKernel Tara Ocean dataset is accessed via
TARAoceans and contains the following:
TARAoceans$phychem(continuous matrix): 139 rows and 22 columns. Each row represents a sample and each column an environmental variable.
TARAoceans$pro.phylo(continuous matrix): 139 rows and 356 columns. Contains information on the prokaryotic OTUs (Operational Taxonomic Unit).
TARAoceans$taxonomy(categorical matrix): 356 rows and 6 columns. Indicates the taxonomy of each OTU represented in the
phyloobject): represents the prokaryotic OTU phylogenetic tree (see the
TARAoceans$pro.NOGs(categorical matrix): 139 rows and 638 columns. Contains NOG data.
TARAoceans$sample(categorical vector): a list containing three following entries (all three are character vectors each of legnth 139):
ocean(oceanic region of the sample) and
Individual Kernel Computation
For each inputted dataset, a kernel is computed using the
compute.kernel() function. It can be constructed in a linear (
"linear"), phylogenic (
"phylogenetic") or abundance (
"abundance") manner. The
kernel.func parameter can also be used to pass in a specific, user defined kernel. After applying this function, the resulting object has a
kernel component which contains the kernel matrix – a symmetric matrix with a size equal to the number of samples in the input dataset.
phychem.kernel = compute.kernel(TARAoceans$phychem, kernel.func = "linear") pro.phylo.kernel = compute.kernel(TARAoceans$pro.phylo, kernel.func = "abundance") pro.NOGs.kernel = compute.kernel(TARAoceans$pro.NOGs, kernel.func = "abundance")
cim.kernel() function allows for a general overview of the correlation structure between datasets. From Figure 1, the
pro.NOGs datasets are the most highly (and positively) correlated. This result is expected as both kernels provide a summary of prokaryotic communities.
cim.kernel(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "square")
FIGURE 1: Simple Clustered Image Map depicting the correlation structure between the three inputted datasets. The colour uses the legend (on the right) to denote the correlation strength.
Combined Kernel Computation
combine.kernels() implements 3 different methods for combining kernels:
full-UMKL (see more details in Mariette and Villa-Vialaneix, 2017). It returns a meta-kernel that can be used as an input for the function
kernel.pca(). The three methods bring complementary information and must be chosen according to the research question.
STATIS-UMKL approach gives an overview on the common information between the different datasets. The
full-UMKL computes a kernel that minimizes the distortion between all input kernels.
sparse-UMKL is a sparse variant of full-UMKL but also selects the most relevant kernels.
meta.kernel = combine.kernels(phychem = phychem.kernel, pro.phylo = pro.phylo.kernel, pro.NOGs = pro.NOGs.kernel, method = "full-UMKL")
Kernel Principal Component Analysis (KPCA)
KPCA utilises the generated kernels to improve the performance of the standard PCA algorithm. Data that otherwise could not have its dimensions reduced by PCA can under dimension reduction via KPCA. A KPCA is applied here on the combined kernel generated above vua the
kernel.pca() function. The
ncomp parameter controls the number of novel components to extract.
kernel.pca.result = kernel.pca(meta.kernel, ncomp = 10)
The projection of the samples onto the new components (within the kernel) is shown below in Figure 2.
all.depths = levels(factor(TARAoceans$sample$depth)) depth.pch = c(20, 17, 4, 3)[match(TARAoceans$sample$depth, all.depths)] plotIndiv(kernel.pca.result, comp = c(1, 2), ind.names = FALSE, legend = TRUE, group = as.vector(TARAoceans$sample$ocean), col.per.group = c("#f99943", "#44a7c4", "#05b052", "#2f6395", "#bb5352", "#87c242", "#07080a", "#92bbdb"), pch = depth.pch, pch.levels = TARAoceans$sample$depth, legend.title = "Ocean / Sea", title = "Projection of TARA Oceans stations", size.title = 10, legend.title.pch = "Depth")
FIGURE 2: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
It can be seen in Figure 3 that the first component of this kernel explains nearly 20% of the data's variance.
FIGURE 3: Explained variance of each principal component of a KPCA run over the meta kernel of the three inputted dataframes.
Assessing important variables
Here we focus on the information summarised on the first component. Variables values are randomly permuted with the function
In the following example, physical variables are permuted at the variable level (kernel
phychem), OTU abundances from
pro.phylo kernel are permuted at the phylum level (OTU phyla are stored in the second column, named
Phylum, of the taxonomy annotation provided in
TARAoceans object in the entry
taxonomy) and gene abundances from
pro.NOGs are permuted at the GO level (GO are provided in the entry
TARAoceans$GO of the dataset).
kernel.pca.permute() function determines the importance of a given variable by computing the Crone-Crosby distance  between the original vector of values and a vector of these values that have been permuted randomly.
# here we set a seed for reproducible results with this tutorial set.seed(17051753) kernel.pca.result = kernel.pca.permute(kernel.pca.result, ncomp = 1, phychem = colnames(TARAoceans$phychem), pro.phylo = TARAoceans$taxonomy[ ,"Phylum"], pro.NOGs = TARAoceans$GO)
Results are displayed with the function
plotVar.kernel.pca(). The argument
ndisplay indicates the number of variables to display for each kernel. For each of the three dataframes, the features which contribute to first component (
ncomp = 1 in the
kernel.pca.permute() function) to the greatest degree are shown (Figure 4).
plotVar.kernel.pca(kernel.pca.result, ndisplay = 10, ncol = 3)
FIGURE 4: Importance of the original features to each kernel. The height of each column indicates that features contribution to the first component of the corresponding kernel.
Proteobacteria is the most important variable for the
The relative abundance of
Proteobacteria is then extracted in each of our 139 samples, and each sample is coloured according to the value of this variable in the KPCA projection plot:
# subset features to those which belong to this Phylum selected = which(TARAoceans$taxonomy[ ,"Phylum"] == "Proteobacteria") # for each feature determine the proportion of samples # which belong to the Proteobacteria phylum proteobacteria.per.sample = apply(TARAoceans$pro.phylo[ ,selected], 1, sum) / apply(TARAoceans$pro.phylo, 1, sum) # set colour colfunc = colorRampPalette(c("royalblue", "red")) col.proteo = colfunc(length(proteobacteria.per.sample)) col.proteo = col.proteo[rank(proteobacteria.per.sample, ties = "first")] plotIndiv(kernel.pca.result, comp = c(1, 2), ind.names = FALSE, legend = FALSE, group = c(1:139), col = col.proteo, pch = depth.pch, pch.levels = TARAoceans$sample$depth, legend.title = "Ocean / Sea", title = "Representation of Proteobacteria abundance", legend.title.pch = "Depth")
FIGURE 5: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
Similarly, the temperature is the most important variable for the
phychem kernel. The temperature values can be displayed on the kernel PCA projection as follows:
col.temp = colfunc(length(TARAoceans$phychem[ ,4])) col.temp = col.temp[rank(TARAoceans$phychem[ ,4], ties = "first")] plotIndiv(kernel.pca.result, comp = c(1, 2), ind.names = FALSE, legend = FALSE, group = c(1:139), col = col.temp, pch = depth.pch, pch.levels = TARAoceans$sample$depth, legend.title = "Ocean / Sea", title = "Representation of mean temperature", legend.title.pch = "Depth")
FIGURE 6: Projection of the samples onto the first two Principal Components of a KPCA run over the meta kernel of the three inputted dataframes.
Mariette, J. and Villa-Vialaneix, N. (2017) Integrating TARA Oceans datasets using unsupervised multiple kernel learning. Bioinformatics 34(6)
Crone L. and Crosby D. (1995). Statistical applications of a metric on subspaces to satellite meteorology. Technometrics, 37(3), 324-328.