Due to the sparse and compositional nature of microbiome data, there
are specific pre-processing steps which need to be undergone in order to
avoid spurious results. The mixOmics
package contains
functions to allow for this data wrangling. The Koren bodysite dataset
is used for this example.
Note: there is a section below which explains how to deal with
phyloseq
objects specifically.
There are three primary steps involved in this process:
Note: Steps 1 and 2 can be interchanged as the pre-filtering is based on a percentage of total counts.
Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.
library(mixOmics) # import the mixOmics library
set.seed(5249) # for reproducibility, remove for normal use
The Koren bodysite dataset derives from an examination of the link between oral, gut and plaque microbial communities in patients with atherosclerosis (vs. controls).
The mixOmics
Koren dataset is accessed via
Koren.16S
and contains the following:
Koren.16S$data.TSS
(continuous matrix): 43 rows and 980
columns. The pre-filtered normalised data using Total Sum Scaling
normalisation.Koren.16S$data.raw
(continuous matrix): 43 rows and 980
columns. The prefiltered raw count OTU data which include a 1 offset
(i.e. no 0 values).Koren.16S$taxonomy
(categorical matrix): 980 rows and 7
columns. Contains the taxonomy (ie. Phylum, … Genus, Species) of each
OTU.Koren.16S$indiv
(categorical matrix): 43 rows and 22
columns. Contains all the sample meta data recorded.Koren.16S$bodysite
(categorical vector): factor of
length 43 indicating the bodysite with levels
arterial plaque
, saliva
and
stool
.data("Koren.16S") # extract the microbial data
data.offset <- Koren.16S$data.raw
As can be seen above, the offset has actually already been applied to
this dataset, such that data.raw = raw counts + 1
. However,
this is unlikely to be the case for all data. The offset is very
necessary as the last step involves applying a CLR transformation - log
transformations cannot be done on zeroes. Note that after the
transformation there will be zeroes present in the data, but this cannot
be the case prior to the CLR transformation.
A simple way to achieve this offset is as follows:
data.offset <- data.offset + 1 # apply offset
A concise way to determine that there aren’t any zeroes in the
dataset is the following line. If it returns 0
then the
offset has been correctly applied.
sum(which(data.offset == 0)) # how many zeroes are there after offset
## [1] 0
This step involves removing OTUs (features) for which the sum of counts are below a certain threshold compared to the total sum of all counts. The function is given below and was adapted from Arumugam et al., (2011) [3].
low.count.removal <- function(
data, # OTU count df of size n (sample) x p (OTU)
percent=0.01 # cutoff chosen
)
{
keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
data.filter = data[,keep.otu]
return(list(data.filter = data.filter, keep.otu = keep.otu))
}
Prior to use of this function, ensure that the rows of the dataframe represent samples and the columns represent OTUs.
dim(data.offset) # check samples are in rows
## [1] 43 980
Once this is confirmed, the low.count.removal()
function
should be applied on the entire dataframe. Note that the data being used
was also already pre-filtered, so there will be no change in OTU count
in this example.
# call the function then apply on the offset data
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter
# check the number of variables kept after filtering
# in this particular case we had already filtered the data so no was change made,
# but from now on we will work with 'data.filter'
length(result.filter$keep.otu)
## [1] 980
In other examples (see [1]) there were 43,146 OTUs prior to
pre-filtering. After pre-filtering, there were only 1,674 OTUs. While
this pre-filtering may appear drastic (and is highly dependent on the
bioinformatics steps performed beforehand, such as OTU picking), it will
avoid spurious results in the downstream statistical analysis. The
threshold (in this case percent = 0.01
) is case specific
and should be experimented with.
An extra step that is recommended is to check how heterogeneous the library sizes are per sample. Here the sum of all counts per sample is calculated and represented in a barplot (Figure 1).
lib.size <- apply(data.filter, 1, sum) # determine total count for each sample
barplot(lib.size) # and plot as bar plot
FIGURE 1: Barplot of the library size of each sample from the Koren OTU data.
Depending on the context, samples with extremely large library sizes may be desired to be removed as each sample should be relatively similar in size. This can be achieved via the following:
maximum.lib.size <- 15000
data.filter <- data.filter[-which(lib.size > maximum.lib.size),]
Note: that in this example this code to remove outliers is not run.
For technical, biological and computational reasons, microbiome data is compositional such that they represent proportions or relative information. Total Sum Scaling (TSS) is often used to calculate proportions from compositional data. Proportional data are restricted to a space where the sum of all OTU proportions for a given sample sums to 1. Using standard statistical methods on such data may lead to spurious results. Likewise, any data that are compositional in nature are interpreted into relative counts. Hence, using a CLR transformation allows the circumvention of these spurious results (explained in more dpeth in [4]).
There are two ways of log-ratio transforming the data in mixOmics:
pca
,
plsda
) directly include the argument
logratio = 'CLR'
, so all you need to do is include your
filtered offset data and add this argument (see example below).logratio
argument. In this case, you will need to use the
logratio.transfo()
function as shown below. You can also
use this function if you only have access to TSS (proportions) data and
those were not offset.# undergo PCA on CLR transformed data
pca.result <- pca(data.filter, logratio = 'CLR')
plotIndiv(pca.result, # plot samples
group = Koren.16S$bodysite,
title = 'Koren, PCA Comps 1&2',
legend = TRUE)
FIGURE 2: Samples of Koren OTU data projected onto components after a
CLR transformation (done within the pca()
function).
The following line will correctly apply a CLR transformation in the
scenario where methods without the logratio
parameter are
being used:
# undergo CLR transformation
data.clr <- logratio.transfo(as.matrix(data.filter),
logratio = 'CLR', offset = 0)
The argument offset
is set to 0 here as zeroes in the
dataframe were already dealt with (in step 1). In the case where one
only has access to TSS data and not raw OTU counts, there is likely to
still be zeroes in the data (as no offset is appled). In this situation,
the offset
parameter of the logratio.transfo()
function can be used to prevent any issues with a log transform of a
zero value.
pca.clr <- pca(data.clr) # undergo PCA on CLR transformed data
plotIndiv(pca.clr, # plot samples
group = Koren.16S$bodysite,
title = 'Koren, PCA Comps 1&2',
legend = TRUE)
FIGURE 3: Samples of Koren OTU data projected onto components after a
CLR transformation (done prior to calling the pca()
function).
The two arterial plaque
samples which stand out
(Plaque 235
& Plaque244
) are those which
would have been removed at the end of step 2. Despite the scaling, these
two points are still outliers.
Some microbiome data may derive from the phyloseq
package. While this is ultimately compatible with the
mixOmics
methods, there is a bit of extra pre-processing
that needs to be done first. A big thank you to Ms Laetitia Cardona from
INRAE for writting the following code.
First, the OTU, taxonomy and meta data are extracted.
Note: Ensure that you are using the latest version of R when
using the phyloseq
package.
library(phyloseq)
data(GlobalPatterns) # load the data from the phyloseq package
taxo <- tax_table(GlobalPatterns) # extraction of the taxonomy
meta.data <- GlobalPatterns@sam_data # extraction of the metadata
# extract OTU table from phyloseq object
# samples should be in row and variables in column
data.raw <- t(otu_table(GlobalPatterns))
Then, the offset is applied.
# STEP 1: OFFSET
data.offset <- data.raw+1
sum(which(data.offset == 0)) # check how many zeroes there are
## [1] 0
The dimensions of this large dataframe is checked:
dim(data.offset) # check dimensions
## [1] 26 19216
This is followed by removal of all OTUs with a low total count. This leave 988 OTUs.
# STEP 2: PRE-FILTER
# remove low count OTUs
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter
length(result.filter$keep.otu) # check how many OTUs remain
## [1] 988
Lastly, we need to remove any outliers.
lib.size <- apply(data.filter, 1, sum) # determine size of each library
# remove samples which exceed max library size (15000)
maximum.lib.size <- 15000
data.filter <- data.filter[-which(lib.size > maximum.lib.size),]
Now a PCA can be undergone. Ensure that the logratio
parameter is passed in to complete the third step of pre-processing.
# undergo PCA after CLR transformation
pca.result <- pca(data.filter, logratio = 'CLR')
# plot samples
plotIndiv(pca.result, group = meta.data$SampleType,
title = 'Global Patterns, PCA Comps 1&2')
FIGURE 4: Samples of Global Pattern OTU data projected onto components after a CLR transformation.
Any method which does not use the logratio
parameter is
appropriate to use on microbiome data as long as the
logratio.transfo()
is applied prior. Remember to apply the
offset (step 1) prior to this to avoid having to use the small offset
within this function. The CLR is abolsutely crucial when dealing
with compositional data! That, or the statistical method needs
to account for this compositionality, see [5].
Cumulative Sum Scaling (CSS) normalisation was developed for sparse sequencing count data [6]. CSS is an extension of the quantile normalisation procedure. It consists of a cumulative sum that increases up until a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no log-ratio transformation is applied as it is considered that this normalisation method does not produce compositional data per se. A simple log transformation is then applied.
The code to undergo CSS normalisation is found directly below:
library(metagenomeSeq)
data.metagenomeSeq = newMRexperiment(t(data.filter), # using filtered data
featureData=NULL, libSize=NULL, normFactors=NULL)
p = cumNormStat(data.metagenomeSeq) # default is 0.5
data.cumnorm = cumNorm(data.metagenomeSeq, p=p)
data.CSS = t(MRcounts(data.cumnorm, norm=TRUE, log=TRUE))
# ensure the data are in a correct format: number of samples in rows
dim(data.CSS)