# mixMC: Pre-processing

Here are our data processing steps for microbiome data analysis with the package:

1 – Add an offset of 1 to the whole data matrix to deal with zeroes after centered log ratio transformation

2 – Prefilter the raw count data to remove features with low counts across all samples

3 – Centered log-ratio transformation

Note: Steps 1 and 2 can be swapped as the prefiltering is based on a percentage of total counts.

We give an example in the FAQ to extract the data from the phyloseq package.

Here we use the Koren data set as a working example for pre-filtering and normalisation as the first step in data analysis, using our mixOmics framework for microbial communities analysis mixMC [1]. Whole genome sequencing data can be also processed similarly to what is proposed here. For 16S data, we analyse the taxa at the OTU level, but we can report the results at the family or species level after OTU selection (see examples in the Tab).

The Koren.16S data set includes 43 samples from three bodysites (oral, gut and plaque) in patients with atherosclerosis, measured on 980 OTUs [2]. The data are directly available from the mixOmics package:

library(mixOmics)
data("Koren.16S")
# ls() # lists objects in the working directory


The Koren data set (Koren.16S) includes 16S data at various stage of the processing steps.

• Koren.16S$data.raw: raw counts with an offset of 1 (note: the offset is applied to the whole dataset) • Koren.16S$data.TSS: the Total Sum Scaling on the raw +1 counts (proportional counts, we do not need this for our multivariate analyses).

We also have various meta data information on the OTUs and the samples:

• Koren.16S$taxonomy: a data frame where the row names (OTU IDs) match the row names of the count data and indicate the taxonomy levels (7 columns) of each OTU, • Koren.16S$indiv and Koren.16S$bodysite: meta data information of the samples. ## Offset In this example we have already added an offset of 1 to all data (i.e. data.raw = raw counts + 1). It may not be the case for your own data. This offset is necessary as we will later log transform the data (as log transformation does not like zeroes!). Note that the offset will not circumvent the zero values issue, as after log ratio transformation we will still have zero values. This is a pragmatic step to handle log transformation on zeroes, but feel free to use different approaches. We load the offset count data: data.offset <- Koren.16S$data.raw
dim(data.offset)      # check dimensions

## [1]  43 980

data.offset[1:5,1:5]  # inspect the data

##          2221284 673010 410908 177792 4294607
## Feces659       1      1      1     99       1
## Feces309       1      1      1      1       1
## Mouth599       1      1      2      1       1
## Mouth386       1      1      1      1       1
## Feces32        1      1      1     25       1

# check there are no zeroes in these offset data for
# for CLR transfo
sum(which(data.offset == 0))  # ok!

## [1] 0


## Pre-filtering

We use a pre-filtering step to remove OTUs for which the sum of counts are below a certain threshold compared to the total sum of all counts. The function is given below, and was adapted from Arumugam et al. (2011) [3].

# function to perform pre-filtering
low.count.removal = function(
data, # OTU count data frame of size n (sample) x p (OTU)
percent=0.01 # cutoff chosen
){
keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
data.filter = data[,keep.otu]
return(list(data.filter = data.filter, keep.otu = keep.otu))
}


Then run this function on the data. Before you apply this function, make sure that you have your samples in rows and variables in columns!

dim(data.offset) # check samples are in rows

## [1]  43 980

# call the function then apply on the offset data
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter # check the number of variables kept after filtering # in this particular case we had already filtered the data so no was change made, but from now on we will work with 'data.filter' length(result.filter$keep.otu)

## [1] 980


In other examples (see [1]) we started from 43,146 OTUs and ended with 1,674 OTUs after pre-filtering. While this pre-filtering may appear drastic (and is highly dependent on the bioinformatics steps performed beforehand, such as OTU picking), it will avoid spurious results in the downstream statistical analysis. Feel free to increase that threshold for your own needs, or not use it at all if you prefer.

An extra step we recommend is to check how heterogeneous our library sizes are per sample. Here we calculate the sum of all counts per sample, and represent the values in a barplot.

lib.size <- apply(data.filter, 1, sum)
barplot(lib.size)


We want to ensure that the number of counts for each sample is 'relatively' similar and that there are no obvious outlier samples. Those samples may also appear as outliers in PCA plots downstream. However, as for any sample outlier removal, ensure that you have a good reason to remove them!

# we can investigate further some very high lib sizes,
# just in case those samples influence the downstream analysis:
which(lib.size > 15000)

## Plaque244 Plaque235
##         7         8


## Log-ratio transformation

Microbiome data are compositional, because of technical, biological and computational reasons. Researchers often calculates proportions from the data, using Total Sum Scaling. Proportion data are restricted to a space where the sum of all OTU proportions for a given sample sums to 1. Using standard statistical methods on such data may lead to spurious results. Likewise, any data that are compositional in nature (such as microbiome) are interpreted into relative counts.

Transforming compositional data using log ratios such as Centered Log Ratio transformation (CLR) allows us to circumvent this issue as proposed by [4]. The CLR is our transformation of choice in mixOmics.

Note: as the CLR already divides each count by the geometric mean, it does not make a difference whether we apply TSS + CLR, or CLR on the count (filtered) data directly. This is why we are skipping the TSS step here.

There are two ways of log-ratio transforming the data in mixOmics:

• Some of our functions pca, plsda directly include the argument logratio = 'CLR', so all you need to do is include your filtered offset data and add this argument (see example below).

• Some functions currently do not include the log-ratio argument. In this case, you will need to use the logratio.transfo function as shown below. You can also use this function if you only have access to TSS (proportions) data and those were not offset (see Note below).

# we input the data as a matrix, here no need to add an offset as it is already done
data.clr <- logratio.transfo(as.matrix(data.filter), logratio = 'CLR', offset = 0)


Note: The argument offset is not necessary here, as we went from raw counts, to offset. But there are cases where you may not have access to raw counts but instead TSS data calculated on raw counts, with no offset. The presence of zeroes will make the log-ratio transformation impossible. In this case, you can add a 'tiny' offset using this argument.

Let's do our first PCA, using either the argument logratio = 'CLR' in PCA:

pca.result <- pca(data.filter, logratio = 'CLR')
plotIndiv(pca.result, group = Koren.16S$bodysite, title = 'My first PCA plot')  Or using the CLR data calculated with the log.ratio function. The results should be the same as earlier: pca.clr <- pca(data.clr) plotIndiv(pca.clr, group = Koren.16S$bodysite, title = 'My second PCA plot')


We can clearly see from this plot that two samples stand out. They correspond to those with a large library size despite a scaling (CLR) step.

# FAQ

## Can I apply another method (multivariate or univariate) on the CLR data?

In that case you can use our external function logratio.transfo, see ?logratio.transfo. Make sure you apply it to the data.raw +1 first, it will be easier than having to add a small offset by using the logratio.transfo function. The log ratio transformation is crucial when dealing with compositional data!, unless the compositional nature of the data is accounted for directly in the statistical methods. See also the reference [5].

## Help! I have a phyloseq object!

We explain below step by step how to extract the data in the right format from phyloseq. This code was written by Ms Laetitia Cardona from INRAE (thank you!)

library(phyloseq)
# load the data from the phyloseq package
data(GlobalPatterns)
?GlobalPatterns

# extraction of the taxonomy
Tax <- tax_table(GlobalPatterns)

# extract OTU table from phyloseq object
data.raw <- t(otu_table(GlobalPatterns)) # samples should be in row and variables in column

# offset
data.offset <- data.raw+1
sum(which(data.offset == 0)) # ok

## [1] 0

dim(data.offset)      # check dimensions

## [1]    26 19216

data.offset[1:5,1:5]  # inspect the data

## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##         549322 522457 951 244423 586076
## CL3          1      1   1      1      1
## CC1          1      1   1      1      1
## SV1          1      1   1      1      1
## M31Fcsw      1      1   1      1      1
## M11Fcsw      1      1   1      1      1

# filter using the function above
# call the function then apply on the offset data
result.filter <- low.count.removal(data.offset, percent=0.01)
data.filter <- result.filter$data.filter # check the number of variables kept after filtering length(result.filter$keep.otu)

## [1] 988

dim(data.filter)

## [1]  26 988

# checking the lib size per sample
lib.size <- apply(data.filter, 1, sum)
#barplot(lib.size)  # we hid this here, go ahead
#which(lib.size > 15000)

#Filter the taxonomy corresponding to the selected OTU
# you will need this for outputs and further analyses
Tax.filter <- Tax[unlist(result.filter$keep.otu),] # check that the OTUs are similar between files summary(colnames(data.filter)==rownames(Tax.filter))  ## Mode TRUE ## logical 988  # PCA pca.result <- pca(data.filter, logratio = 'CLR') plotIndiv(pca.result, group = Metadata$SampleType, title = 'My first PCA plot')


# when the number of variables is large, consider the argument 'cutoff'
# plotVar(pca.result, comp = c(1, 2),
#         cutoff = 0.5,
#         cex = 2,
#         title = 'PCA comp 1 - 2')

# var.names: if true, shows the OTU ID, list shows the Class as stored in the taxonomy object
plotVar(pca.result, comp = c(1, 2),
var.names = list(Tax.filter[,"Class"]),
cutoff = 0.5,
cex = 2,
title = 'PCA comp 1 - 2')


## I'd like to apply a CSS normalisation instead!

CSS normalisation was specifically developed for sparse sequencing count data by Paulson et al.. CSS can be considered as an extension of the quantile normalisation approach and consists of a cumulative sum up to a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no log-ratio transformation is applied as we consider that this normalisation method does not produce compositional data per se. A simple log transformation is then applied.

We give below the R script for CSS.

library(metagenomeSeq)

data.metagenomeSeq = newMRexperiment(t(data.filter),
featureData=NULL, libSize=NULL, normFactors=NULL) #using filtered data
p = cumNormStat(data.metagenomeSeq) #default is 0.5
data.cumnorm = cumNorm(data.metagenomeSeq, p=p)
#data.cumnorm
data.CSS = t(MRcounts(data.cumnorm, norm=TRUE, log=TRUE))
dim(data.CSS)  # make sure the data are in a correct formal: number of samples in rows