# mixMC: Pre-Filtering and Normalisation

Here we use the Human Microbiome Most Diverse 16S data set as a worked example for Prefiltering and Normalisation as the first step in data analysis using mixMC.

The Human Microbiome Most Diverse 16S (HMP) data setincludes OTU counts on the three most diverse bodysites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool sampled from 54 unique individuals for a total of 162 samples. The pre-filtering code below is from [1,4].

The HMP data is implemented in mixOmics via diverse.16S:

library(mixOmics)

data(diverse.16S)
data.raw = diverse.16S$data.raw #the diverse.16s raw data include an offset of 1  Ensure to include an offset of 1 to raw data to enable log ratio transformation after the TSS normalisation. ## Prefiltering The prefiltering steps outlined in the example below describe the pre-filtering process of raw microbiome sequencing data counts. # function to perform pre-filtering low.count.removal = function( data, # OTU count data frame of size n (sample) x p (OTU) percent=0.01 # cutoff chosen ){ keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent) data.filter = data[,keep.otu] return(list(data.filter = data.filter, keep.otu = keep.otu)) }  result.filter = low.count.removal(data.raw, percent=0.01) data.filter = result.filter$data.filter
length(result.filter\$keep.otu) # check the number of variables kept after filtering

## [1] 1674


In our HMP example we started from 43,146 OTUs and ended with 1,674 OTUs after prefiltering. While this prefiltering may appear drastic, it is the default value adopted by QIIME and in other studies. The prefiltering step will avoid spurious results in the downstream statistical analysis. Feel free to increase that threshold for your own needs.

# The Details…

### At the sample level…

The first step in pre-filtering is to ensure that the number of counts is sufficiently high in each sample. We therefore removed samples that fell below an arbitrary cutoff.

### At the variable level…

Pre-filtering OTUs can improve alpha diversity measures [5] and may also counteract sequencing errors. For example sequencing errors Illumina MiSeq are estimated to be ~ 1/1000 [6,7]. We propose to remove OTUs if their proportional counts across all samples is below 0.01%. This prefiltering step avoids spurious results in the downstream statistical analysis. The threshold used is the default value in QIIME and is also used in other microbiome studies (e.g. [9,10]).

# Normalisation

One of the characteristics of microbiome sequencing data is sparse counts, which need to be accounted for during normalisation. Two types of normalisation are considered in mixMC methodologies, Total Sum Scaling normalisation (TSS) with a suitable log ratio transformation and Cumulative Sum Scaling normalisation (CSS) followed by log transformation. Because we are using compositional data; a vector containing strictly positive components which sum to a constant (microbial relative abundance within a sample sum to one), results in the data reside in simplex space rather than Euclidian space. Using standard statistical methods on such data leads to spurious results [10] and therefore the data must be transformed.

### TSS & Log Transformation

TSS normalisation is a popular approach to accommodate for varying sampling and sequencing depth. In TSS the variable read count is divided by the total number of read counts in each individual sample. TSS normalisation reflects relative information, and the resulting normalised data reside in a simplex rather than an Euclidian space which may lead to spurious false discoveries if standard statistical methods are applied [10]. Transforming compositional data using log ratios such as Isometric Log Ratio (ILR) or Centered Log Ratio transformation (CLR) allows us to circumvent this issue as proposed by [1] and [2].

See here for an example

### CSS & Log Transformation

CSS normalisation was specifically developed for sparse microbiome data counts by Paulson et al., [3, 4]. CSS can be considered as an extension of the quantile normalisation approach and consists of cumulative sum up to a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no ILR transformation is applied as we consider that this normalisation method does not produce pure compositional data. A simple log transformation is applied.

See here for an example

### Transformation

If normalisation = 'TSS', then a ILR transformation follows to account for compositional data inside the pca.R function. The components and loading vectors are back transformed inside the function to a CLR space.

If normalisation = 'CSS' then no further transformation is necessary (the log transformation is applied implicitly when using the metagenomeSeq package).

Note: Normalisation is data specific and needs to be carefully chosen prior to statistical analysis. We recommend using ILR transformation with PCA to overcome the CLR limitation that may lead to singular covariance matrices, and a CLR transformation with sPLS-DA.

For consistency in analysis, you can choose to use CLR transformation for both PCA and sPLS-DA analysis but understand that there will be a difference in the result. Indeed as outlined in the mixMC manuscript using the HMP most diverse data showed that both TSS and CSS normalisations identified the same bacteria families. However in the more complex Oral case study we observed differences as TSS+CLR led to the identification of a greater number of families than CSS.