mixMC: mixOmics for 16s Microbial Communities

mixMC: mixOmics for 16S Microbial Communities

mixMC 1 is a multivariate framework for microbiome data analysis which takes into account the sparsity and compositionality of microbiome data. mixMC aims to identify specific associations between microbial communities and their type of habitat building on the hypothesis that multivariate methods can help identify microbial communities that modulate and influence biological systems as a whole.

mixMC addresses the limitations of existing multivariate methods for microbiome studies and proposes unique analytical capabilities: it handles compositional and sparse data, repeated-measures experiments and multiclass problems; it highlights important discriminative features, and it provides interpretable graphical outputs to better understand the microbial communities contribution to each habitat.

To begin…

Load required libraries:

#install.packages("mixOmics")
library(mixOmics)

Data

In the tabs under mixMC, examples are provided applying mixMC to microbiome data sets. To download the full data sets and scripts see these links:

Non-Repeated Measures – Koren mixMC example

Repeated Measures – HMP 16s Data

Repeated Measures – Oral 16S Data

Microbiome data

Culture independent techniques, such as shotgun metagenomics and 16S rRNA amplicon sequencing have dramatically changed the way we can examine microbial communities. However, current statistical methods are limited in their scope to identifying and comparing bacteria driving changes in their ecosystem. This is partly due to the inherent properties of microbiome data.

The absence of microbial organisms from a large number of samples results in highly skewed count data with many zeros sparse count data. In addition, the varying sampling/sequencing depth between samples requires transformation of the count data into relative abundance (proportions) leading to compositional data.

mixMC: method improvements for microbiome 16S data analysis

Compositional data pose statistical theoretical issues and potentially considerable misinterpretation with standard methods [2], as such data within a specimen sample sum to one, resulting in data residing in a simplex, rather than an Euclidian space. The solution proposed by several authors is to project the relative count data into an Euclidian space using log ratio transformations, such as centred log ratio transformation (CLR), before applying standard statistical techniques. The CLR transformation consists in dividing each sample by the geometric mean of its values and taking the logarithm [2],[3]. The transformation is symmetric, resulting in the retention of dimensions in the data [4].

mixMC, the mixOmics method sPLS-DA has been improved with CLR transformation and includes a multilevel decomposition for repeated measurements design that are commonly encountered in microbiome studies. The multilevel approach from [5] enables the detection of subtle differences when high inter-subject variability is present due to microbial sampling performed repeatedly on the same subjects but in multiple habitats. To account for subject variability the data variance is decomposed into within variation (due to habitant) and between subject variation [6], similar to a within-subjects ANOVA in univariate analyses.

Graphical outputs such as plotLoadings() can be used to represent the habitat in which the selected micro-organism is most present.

References

1 Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169

2 Aitchison, J., 1982. The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), pp.139-177.

3 Fernandes, A.D., Reid, J.N., Macklaim, J.M., McMurrough, T.A., Edgell, D.R. and Gloor, G.B., 2014. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome, 2(1), p.1.

4 Filzmoser, P., Hron, K. and Reimann, C., 2009. Principal component analysis for compositional data with outliers. Environmetrics, 20(6), pp.621-632.

5 Westerhuis, J.A., van Velzen, E.J., Hoefsloot, H.C. and Smilde, A.K., 2010. Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6(1), pp.119-128.

6 Liquet, B., Lê Cao, K.A., Hocini, H. and Thiébaut, R., 2012. A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC bioinformatics, 13(1), p.325.