Pre_Filtering & Normalisation

mixMC: Pre-Filtering and Normalisation

Here we use the Human Microbiome Most Diverse 16S data set as a worked example for Prefiltering and Normalisation as the first step in data analysis using mixMC.

The Human Microbiome Most Diverse 16S (HMP) data setincludes OTU counts on the three most diverse bodysites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool sampled from 54 unique individuals for a total of 162 samples. The pre-filtering code below is from [1,4].

The HMP data is implemented in mixOmics via diverse.16S:

data.raw = diverse.16S$data.raw #the diverse.16s raw data include an offset of 1

Ensure to include an offset of 1 to raw data to enable log ratio transformation after the TSS normalisation.


The prefiltering steps outlined in the example below describe the pre-filtering process of raw microbiome sequencing data counts.

# function to perform pre-filtering
low.count.removal = function(
                        data, # OTU count data frame of size n (sample) x p (OTU)
                        percent=0.01 # cutoff chosen
    keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
    data.filter = data[,keep.otu]
    return(list(data.filter = data.filter, keep.otu = keep.otu))
result.filter = low.count.removal(data.raw, percent=0.01)
data.filter = result.filter$data.filter
length(result.filter$keep.otu) # check the number of variables kept after filtering
## [1] 1674

In our HMP example we started from 43,146 OTUs and ended with 1,674 OTUs after prefiltering. While this prefiltering may appear drastic, it is the default value adopted by QIIME and in other studies. The prefiltering step will avoid spurious results in the downstream statistical analysis. Feel free to increase that threshold for your own needs.

The Details…

At the sample level…

The first step in pre-filtering is to ensure that the number of counts is sufficiently high in each sample. We therefore removed samples that fell below an arbitrary cutoff.

At the variable level…

Pre-filtering OTUs can improve alpha diversity measures [5] and may also counteract sequencing errors. For example sequencing errors Illumina MiSeq are estimated to be ~ 1/1000 [6,7]. We propose to remove OTUs if their proportional counts across all samples is below 0.01%. This prefiltering step avoids spurious results in the downstream statistical analysis. The threshold used is the default value in QIIME and is also used in other microbiome studies (e.g. [9,10]).


One of the characteristics of microbiome sequencing data is sparse counts, which need to be accounted for during normalisation. Two types of normalisation are considered in mixMC methodologies, Total Sum Scaling normalisation (TSS) with a suitable log ratio transformation and Cumulative Sum Scaling normalisation (CSS) followed by log transformation. Because we are using compositional data; a vector containing strictly positive components which sum to a constant (microbial relative abundance within a sample sum to one), results in the data reside in simplex space rather than Euclidian space. Using standard statistical methods on such data leads to spurious results [10] and therefore the data must be transformed.

TSS & Log Transformation

TSS normalisation is a popular approach to accommodate for varying sampling and sequencing depth. In TSS the variable read count is divided by the total number of read counts in each individual sample. TSS normalisation reflects relative information, and the resulting normalised data reside in a simplex rather than an Euclidian space which may lead to spurious false discoveries if standard statistical methods are applied [10]. Transforming compositional data using log ratios such as Isometric Log Ratio (ILR) or Centered Log Ratio transformation (CLR) allows us to circumvent this issue as proposed by [1] and [2].

See here for an example

CSS & Log Transformation

CSS normalisation was specifically developed for sparse microbiome data counts by Paulson et al., [3, 4]. CSS can be considered as an extension of the quantile normalisation approach and consists of cumulative sum up to a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no ILR transformation is applied as we consider that this normalisation method does not produce pure compositional data. A simple log transformation is applied.

See here for an example


If normalisation = 'TSS', then a ILR transformation follows to account for compositional data inside the pca.R function. The components and loading vectors are back transformed inside the function to a CLR space.

If normalisation = 'CSS' then no further transformation is necessary (the log transformation is applied implicitly when using the metagenomeSeq package).

Note: Normalisation is data specific and needs to be carefully chosen prior to statistical analysis. We recommend using ILR transformation with PCA to overcome the CLR limitation that may lead to singular covariance matrices, and a CLR transformation with sPLS-DA.

For consistency in analysis, you can choose to use CLR transformation for both PCA and sPLS-DA analysis but understand that there will be a difference in the result. Indeed as outlined in the mixMC manuscript using the HMP most diverse data showed that both TSS and CSS normalisations identified the same bacteria families. However in the more complex Oral case study we observed differences as TSS+CLR led to the identification of a greater number of families than CSS.


  1. Aitchison, J.: The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), 139-177 (1982)
  2. Filzmoser, P., Hron, K., Reimann, C.: Principal component analysis for compositional data with outliers. Environmetrics 20(6), 621–632 (2009)
  3. Paulson, J.N., Stine, O.C.,Bravo, H.C., Pop, M.: Differential abundance analysis for microbial marker-gene surveys. Nature methods 10(12), 1200–1202 (2013)
  4. Paulson, J.N., Pop, M., Bravo, H.C.: metagenomeSeq : Statistical analysis for sparse high-throughput sequencing. Bioconductor package: 1.6.0. (2015).
  5. Bokulich, N.A., Subramanian, S., Faith, J.J., Gevers, D., Gordon, J.I., Knight, R., Mills, D.A., Caporaso, J.G.: Quality-filtering vastly improves diversity estimates from illumina amplicon sequencing. Nature methods 10(1), 57-59 (2013).
  6. Kunin V, Engelbrektson A, Ochman H, Hugenholtz P. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environmental Microbiology12(1):118-23 (2010)
  7. Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environmental Microbiolog 12(7):1889-98 ( 2010)
  8. Knights D., Parfrey L.W., Zaneveld J., Lozupone C., Knight R.: Human-associated microbial signatures: examining their predictive value. Cell host & microbe 10(4), 292–296 (2011)
  9. Arumugam M., Raes J., Pelletier E., Le Paslier D., Yamada T., Mende D.R., et al.: Enterotypes of the human gut microbiome. Nature 473 (7346), 174–180 (2011)
  10. Mandal, S., Van Treuren, W., White, R.A., Eggesbø, M., Knight, R. and Peddada, S.D., 2015. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial ecology in health and disease, 26