# MixMC Pre-processing

Due to the sparse and compositional nature of microbiome data, there are specific pre-processing steps which need to be undergone in order to avoid spurious results. The mixOmics package contains functions to allow for this data wrangling. The Koren bodysite dataset is used for this example.

Note: there is a section below which explains how to deal with phyloseq objects specifically.

## General Process

There are three primary steps involved in this process:

1. Add an offset of 1 to the whole data matrix to deal with zeroes after centered log ratio transformation
2. Pre-filter the raw count data to remove features with low counts across all samples
3. Centered log-ratio (CLR) transformation

Note: Steps 1 and 2 can be interchanged as the pre-filtering is based on a percentage of total counts.

## R script

The R script used for all the analysis in this case study is available here.

## To begin

Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.

library(mixOmics) # import the mixOmics library

set.seed(5249) # for reproducibility, remove for normal use


### The data

The Koren bodysite dataset derives from an examination of the link between oral, gut and plaque microbial communities in patients with atherosclerosis (vs. controls).

The mixOmics Koren dataset is accessed via Koren.16S and contains the following:

• Koren.16S$data.TSS (continuous matrix): 43 rows and 980 columns. The pre-filtered normalised data using Total Sum Scaling normalisation. • Koren.16S$data.raw (continuous matrix): 43 rows and 980 columns. The prefiltered raw
count OTU data which include a 1 offset (i.e. no 0 values).
• Koren.16S$taxonomy (categorical matrix): 980 rows and 7 columns. Contains the taxonomy (ie. Phylum, … Genus, Species) of each OTU. • Koren.16S$indiv (categorical matrix): 43 rows and 22 columns. Contains all the sample meta data recorded.
• Koren.16S$bodysite (categorical vector): factor of length 43 indicating the bodysite with levels arterial plaque, saliva and stool. data("Koren.16S") # extract the microbial data data.offset <- Koren.16S$data.raw


## Step 1: Applying the offset

As can be seen above, the offset has actually already been applied to this dataset, such that data.raw = raw counts + 1. However, this is unlikely to be the case for all data. The offset is very necessary as the last step involves applying a CLR transformation – log transformations cannot be done on zeroes. Note that after the transformation there will be zeroes present in the data, but this cannot be the case prior to the CLR transformation.

A simple way to achieve this offset is as follows:

data.offset <- data.offset + 1 # apply offset


A concise way to determine that there aren't any zeroes in the dataset is the following line. If it returns 0 then the offset has been correctly applied.

sum(which(data.offset == 0)) # how many zeroes are there after offset

## [1] 0


## Step 2: Pre-filtering

This step involves removing OTUs (features) for which the sum of counts are below a certain threshold compared to the total sum of all counts. The function is given below and was adapted from Arumugam et al., (2011) [3].

low.count.removal <- function(
data, # OTU count df of size n (sample) x p (OTU)
percent=0.01 # cutoff chosen
)
{
keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
data.filter = data[,keep.otu]
return(list(data.filter = data.filter, keep.otu = keep.otu))
}


Prior to use of this function, ensure that the rows of the dataframe represent samples and the columns represent OTUs.

dim(data.offset) # check samples are in rows

## [1]  43 980


Once this is confirmed, the low.count.removal() function should be applied on the entire dataframe. Note that the data being used was also already pre-filtered, so there will be no change in OTU count in this example.

# call the function then apply on the offset data
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter # check the number of variables kept after filtering # in this particular case we had already filtered the data so no was change made, # but from now on we will work with 'data.filter' length(result.filter$keep.otu)

## [1] 980


In other examples (see [1]) there were 43,146 OTUs prior to pre-filtering. After pre-filtering, there were only 1,674 OTUs. While this pre-filtering may appear drastic (and is highly dependent on the bioinformatics steps performed beforehand, such as OTU picking), it will avoid spurious results in the downstream statistical analysis. The threshold (in this case percent = 0.01) is case specific and should be experimented with.

An extra step that is recommended is to check how heterogeneous the library sizes are per sample. Here the sum of all counts per sample is calculated and represented in a barplot (Figure 1).

lib.size <- apply(data.filter, 1, sum) # determine total count for each sample
barplot(lib.size) # and plot as bar plot


FIGURE 1: Barplot of the library size of each sample from the Koren OTU data.

Depending on the context, samples with extremely large library sizes may be desired to be removed as each sample should be relatively similar in size. This can be achieved via the following:

maximum.lib.size <- 15000

data.filter <- data.filter[-which(lib.size > maximum.lib.size),]


Note: that in this example this code to remove outliers is not run.

## Step 3: Centered Log Ratio (CLR) Transformation

For technical, biological and computational reasons, microbiome data is compositional such that they represent proportions or relative information. Total Sum Scaling (TSS) is often used to calculate proportions from compositional data. Proportional data are restricted to a space where the sum of all OTU proportions for a given sample sums to 1. Using standard statistical methods on such data may lead to spurious results. Likewise, any data that are compositional in nature are interpreted into relative counts. Hence, using a CLR transformation allows the circumvention of these spurious results (explained in more dpeth in [4]).

There are two ways of log-ratio transforming the data in mixOmics:

• Option 1: Some of our functions (pca, plsda) directly include the argument logratio = 'CLR', so all you need to do is include your filtered offset data and add this argument (see example below).
• Option 2: Some functions currently do not include the logratio argument. In this case, you will need to use the logratio.transfo() function as shown below. You can also use this function if you only have access to TSS (proportions) data and those were not offset.

## Undergoing a PCA

### Option 1

# undergo PCA on CLR transformed data
pca.result <- pca(data.filter, logratio = 'CLR')

plotIndiv(pca.result,  # plot samples
group = Koren.16S$bodysite, title = 'Koren, PCA Comps 1&2', legend = TRUE)  FIGURE 2: Samples of Koren OTU data projected onto components after a CLR transformation (done within the pca() function). ### Option 2 The following line will correctly apply a CLR transformation in the scenario where methods without the logratio parameter are being used: # undergo CLR transformation data.clr <- logratio.transfo(as.matrix(data.filter), logratio = 'CLR', offset = 0)  The argument offset is set to 0 here as zeroes in the dataframe were already dealt with (in step 1). In the case where one only has access to TSS data and not raw OTU counts, there is likely to still be zeroes in the data (as no offset is appled). In this situation, the offset parameter of the logratio.transfo() function can be used to prevent any issues with a log transform of a zero value. pca.clr <- pca(data.clr) # undergo PCA on CLR transformed data plotIndiv(pca.clr, # plot samples group = Koren.16S$bodysite,
title = 'Koren, PCA Comps 1&2',
legend = TRUE)


FIGURE 3: Samples of Koren OTU data projected onto components after a CLR transformation (done prior to calling the pca() function).

The two arterial plaque samples which stand out (Plaque 235 & Plaque244) are those which would have been removed at the end of step 2. Despite the scaling, these two points are still outliers.

## Pre-processing with a phyloseq object

Some microbiome data may derive from the phyloseq package. While this is ultimately compatible with the mixOmics methods, there is a bit of extra pre-processing that needs to be done first. A big thank you to Ms Laetitia Cardona from INRAE for writting the following code.

First, the OTU, taxonomy and meta data are extracted.

Note: Ensure that you are using the latest version of R when using the phyloseq package.

library(phyloseq)
data(GlobalPatterns) # load the data from the phyloseq package

taxo <- tax_table(GlobalPatterns) # extraction of the taxonomy

meta.data <- GlobalPatterns@sam_data # extraction of the metadata

# extract OTU table from phyloseq object
# samples should be in row and variables in column
data.raw <- t(otu_table(GlobalPatterns))


Then, the offset is applied.

# STEP 1: OFFSET

data.offset <- data.raw+1
sum(which(data.offset == 0)) # check how many zeroes there are

## [1] 0


The dimensions of this large dataframe is checked:

dim(data.offset) # check dimensions

## [1]    26 19216


This is followed by removal of all OTUs with a low total count. This leave 988 OTUs.

# STEP 2: PRE-FILTER

# remove low count OTUs
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter length(result.filter$keep.otu) # check how many OTUs remain

## [1] 988


Lastly, we need to remove any outliers.

lib.size <- apply(data.filter, 1, sum) # determine size of each library

# remove samples which exceed max library size (15000)
maximum.lib.size <- 15000
data.filter <- data.filter[-which(lib.size > maximum.lib.size),]


Now a PCA can be undergone. Ensure that the logratio parameter is passed in to complete the third step of pre-processing.

# undergo PCA after CLR transformation
pca.result <- pca(data.filter, logratio = 'CLR')

# plot samples
plotIndiv(pca.result, group = meta.data\$SampleType,
title = 'Global Patterns, PCA Comps 1&2')


FIGURE 4: Samples of Global Pattern OTU data projected onto components after a CLR transformation.

## FAQ

#### Can I apply another mixOmics method (multivariate or univariate) on the CLR data?

Any method which does not use the logratio parameter is appropriate to use on microbiome data as long as the logratio.transfo() is applied prior. Remember to apply the offset (step 1) prior to this to avoid having to use the small offset within this function. The CLR is abolsutely crucial when dealing with compositional data! That, or the statistical method needs to account for this compositionality, see [5].

#### Can I apply a CSS normalisation instead?

Cumulative Sum Scaling (CSS) normalisation was developed for sparse sequencing count data [6]. CSS is an extension of the quantile normalisation procedure. It consists of a cumulative sum that increases up until a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no log-ratio transformation is applied as it is considered that this normalisation method does not produce compositional data per se. A simple log transformation is then applied.

The code to undergo CSS normalisation is found directly below:

library(metagenomeSeq)

data.metagenomeSeq = newMRexperiment(t(data.filter), # using filtered data
featureData=NULL, libSize=NULL, normFactors=NULL)
p = cumNormStat(data.metagenomeSeq) # default is 0.5
data.cumnorm = cumNorm(data.metagenomeSeq, p=p)

data.CSS = t(MRcounts(data.cumnorm, norm=TRUE, log=TRUE))

# ensure the data are in a correct format: number of samples in rows
dim(data.CSS)