Here we use the Human Microbiome Most Diverse 16S data set as a worked example for Prefiltering and Normalisation as the first step in data analysis using our mixOmics framework for microbial communities analysis mixMC.

The Human Microbiome Most Diverse 16S (HMP) data set includes OTU counts on the three most diverse bodysites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool sampled from 54 unique individuals for a total of 162 samples.

The HMP data set is directly available from the mixOmics package:

data.raw = diverse.16S$data.raw 

Here the raw data include an offset of 1. The Total Sum Scaling normalisation is OK with 0 data, but not the log ratio transformation. Therefore we set data.raw = data.raw+1. Note that the offset will not circumvent the zero values issue, as after log ratio transformation we will still have zero values.

You can check that the offset was applied:

sum(which(data.raw == 0))
## [1] 0


We use a prefiltering step to remove OTUs for which the sum of counts are below a certain threshold compared to the total sum of all counts. The function to prefilter is given below, and was adapted from Arumugam et al. (2011) [0].

# function to perform pre-filtering
low.count.removal = function(
                        data, # OTU count data frame of size n (sample) x p (OTU)
                        percent=0.01 # cutoff chosen
    keep.otu = which(colSums(data)*100/(sum(colSums(data))) > percent)
    data.filter = data[,keep.otu]
    return(list(data.filter = data.filter, keep.otu = keep.otu))
result.filter = low.count.removal(data.raw, percent=0.01)
data.filter = result.filter$data.filter
length(result.filter$keep.otu) # check the number of variables kept after filtering
## [1] 1674

In our HMP example we started from 43,146 OTUs and ended with 1,674 OTUs after prefiltering. While this prefiltering may appear drastic (and is highly dependent on the bioinformatics steps performed beforehand such as OTU picking), it will avoid spurious results in the downstream statistical analysis. Feel free to increase that threshold for your own needs, or not use it at all if you prefer.

Check your library size

At this stage you may want to ensure that the number of counts for each sample is relatively' similar and that there is no obvious outlier sample. Those samples may also appear as outliers in PCA plots downstream. As for any sample outlier removal, ensure that you have a good reason to remove them!


Because of uneven sequencing depths, library sizes often differ from one sample to another. Two types of scaling / normalisation currently exist to accommodate for library size:

  • Total Sum Scaling normalisation (TSS) which needs to be followed by log ratio transformation, see below

  • Cumulative Sum Scaling normalisation (CSS) followed by log transformation.

TSS results in compositional data (or proportions) that are restricted to a space where the sum of all OTU proportions for a given sample sums to 1. Using standard statistical methods on such data may lead to spurious results [1] and therefore the data must be further transformed.


TSS normalisation is a popular approach to accommodate for varying sampling and sequencing depth. In TSS the variable read count is divided by the total number of read counts in each individual sample.

Here is a simple function to TSS, the raw data (with an offset of 1) have samples in rows and OTUs in columns.

# each variable read count is divided by the total number of read counts
TSS.divide = function(x){
# function is applied to each row (i.e. each sample)
data.TSS = t(apply(data.filter, 1, TSS.divide))

TSS normalisation reflects relative information, and the resulting normalised data reside in a simplex (bounded) rather than an Euclidian space. Transforming compositional data using log ratios such as Isometric Log Ratio (ILR) or Centered Log Ratio transformation (CLR) allows us to circumvent this issue as proposed by [1] and [2].

Log ratio transformation

ILR and CLR transformations are implemented directly into our multivariate methods pca, plsda and splsda, for example (see the Tab for the Koren data analysis):

koren.pca = pca(koren.TSS, ncomp = 10, logratio = 'CLR')

koren.pca = pca(koren.TSS, ncomp = 10, logratio = 'ILR')

Which log ratio transformation to use?

According to Filmozer et al [2], ILR transformation is best for a PCA analysis. However, for a PLS-DA and sPLS-DA analysis, logratio = 'CLR' is necessary for OTU selection (see details in Lê Cao et al. 2016 [3]). Generally speaking, a PCA with either TSS+CLR or TSS+ILR may not make much differences visually, but the transformed data will be different. We generally prefer the ‘CLR’ log ratio transformation as it is faster and can be used consistent throughout our mixMC framework (from PCA to sPLSDA).

What if I want to apply another method (multivariate or univariate) on the ILR or CLR data?

In that case you can use our external function logratio.transfo, see ?logratio.transfo. Make sure you apply it to the TSS(data.raw +1) first, it will be easier than having to add a small offset by using the loratio.transfo function. The log ratio transformation is crucial when dealing with proportional data!, unless the compositional nature of the data is accounted for directly in the statistical methods. See also the very good reference form Pawlowsky-Glahn et al. 2015 [4].

CSS & Log Transformation

CSS normalisation was specifically developed for sparse sequencing count data by Paulson et al., [5, 6]. CSS can be considered as an extension of the quantile normalisation approach and consists of cumulative sum up to a percentile determined using a data-driven approach. CSS corrects the bias in the assessment of differential abundance introduced by TSS and, according to the authors, would partially account for compositional data. Therefore, for CSS normalised data, no ILR transformation is applied as we consider that this normalisation method does not produce compositional data per say. A simple log transformation is then applied.

We give below the R script for CSS.

# source("")
# biocLite("metagenomeSeq")

data.metagenomeSeq = newMRexperiment(t(data.filter), #phenoData=phenotypeData, 
                      featureData=NULL, libSize=NULL, normFactors=NULL) #using filtered data 
p = cumNormStat(data.metagenomeSeq) #default is 0.5
data.cumnorm = cumNorm(data.metagenomeSeq, p=p)
data.CSS = t(MRcounts(data.cumnorm, norm=TRUE, log=TRUE)) 
dim(data.CSS)  # make sure the data are in a correct formal: number of samples in rows
## [1]  162 1674

*Note however that in our mixMC examples presented here we favoured the TSS normalisation. *.


  1. Arumugam M., Raes J., Pelletier E., Le Paslier D., Yamada T., Mende D.R., et al.: Enterotypes of the human gut microbiome. Nature 473 (7346), 174–180 (2011)

  2. Aitchison, J.: The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), 139-177 (1982)

  3. Filzmoser, P., Hron, K., Reimann, C.: Principal component analysis for compositional data with outliers. Environmetrics 20(6), 621–632 (2009)

  4. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169

  5. Pawlowsky-Glahn V, Egozcue J, Tolosana-Delgado R (2015) Modeling and Analysis of Compositional Data, Wiley

  6. Paulson, J.N., Stine, O.C.,Bravo, H.C., Pop, M.: Differential abundance analysis for microbial marker-gene surveys. Nature methods 10(12), 1200–1202 (2013)

  7. Paulson, J.N., Pop, M., Bravo, H.C.: metagenomeSeq : Statistical analysis for sparse high-throughput sequencing. Bioconductor package: 1.6.0. (2015).