Case study: Koren diverse bodysites

Here we apply mixMC to the 16S study of Koren et al. (2011) who examined the link between oral, gut and plaque microbial communities in patients with atherosclerosis and controls. Only healthy individuals were retained in the analysis. This study contained partially repeated measures from multiple sites including 15 unique patients samples from saliva and stool, and 13 unique patients only sampled from arterial plaque samples and we therefore considered a non multilevel analysis for that experimental design (i.e. no repeated measurements). The data were downloaded from the Qiita database. After pre-filtering, the data included 973 OTU for 43 samples. Data were TSS normalized and log ratio transformations are applied in PCA and PLS-DA.

Load the latest version of mixOmics.


Data: Koren Data Set

Using the Koren data in this example we assume the data are prefiltered and normalised with TSS, as described in mixMC pre-processing steps.

# normalised data
koren.TSS= Koren.16S$data.TSS

We set up the outcome Y as a factor indicating the body sites of each sample. A repeated measures design is not nessessary as each sample is not repeated more then once.

Y  =  Koren.16S$indiv$body_site

Unsupervised Analysis with Principal Component Analysis

PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are dealing with proportions, the data are compositional and spurious results can arise if the data are not transformed into an Euclidean space. Therefore the Total Sum Scaling normalised data must be transformed using either ILR or CLR.

Here we chose the ILR transformation as advised by Filmozer et al [1]. We generally prefer the ‘CLR’ log ratio transformation as it is faster and can be used all throughout our framework (from PCA to sPLSDA), using the argument logratio = 'CLR'). We first run a PCA with a sufficiently large number of components ncomp to choose the final reduced dimension of the model.

koren.pca = pca(koren.TSS, ncomp = 10, logratio = 'CLR')

plot of chunk pca

The barplot above depicts the percentage of explained variance per component for a PCA with 10 components. It is a good indicator of how many components to retain in the final PCA model (here two components will be sufficient).

The sample plot below displays the samples colored according to Y, but note that PCA does not take Y into account in the analysis!

          comp = c(1,2), # the components to plot
          pch = 16, 
          ind.names = F, 
          group = Y, 
 = color.mixo(1:3),
          legend = TRUE,
          title = 'Koren, PCA comp 1 - 2')

plot of chunk plotIndiv PCA

Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA

The mixMC frameworks uses the sPLS-DA multivariate analysis from mixOmics[3]. The sPLS-DA selects the most discriminative OTUs that best characterise each body site, see sPLS-DA.

Note that with PLS-DA and sPLS-DA we can only choose a CLR transformation (see details here.

We first run the perf function with a PLS-DA model with no variable selection. Often ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but it depends on the data. We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).

Here we assess the performance of the PLSDA on 3 components.

koren.plsda = plsda(X = koren.TSS, Y, ncomp = nlevels(Y), logratio = 'CLR')

koren.perf.plsda = perf(koren.plsda, validation = 'Mfold', folds = 5,
                    progressBar = FALSE, nrepeat = 10)

plot(koren.perf.plsda, overlay = 'measure', sd=TRUE)

plot of chunk plsda

The plot indicates a decrease in the classification error rate (i.e. an increase in classification performance) from one component to 3 components in the model. The BER stands for Balanced Error Rate and should be considered when we have an unbalanced number of samples per group. Here the number of samples per group is very similar, which explains why both overall and BER are overlapping. The performance reaches its best for 3 components, which suggest ncomp = 3 for a final PLSDA model. Note that for the sparse PLSDA we may obtain a different ncomp.

For more details on the prediction distances, refer to Suppl info in Rohart et al. 2017 [4]. Here all distances are similar, we choose 'max.dist' for sPLSDA.

#First two components
plotIndiv(koren.plsda , comp = c(1,2),
          group = Y, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE, title = 'Koren, PLSDA comp 1 - 2')

plot of chunk plot plsda

It is possible that the third component may not visually add much to the model and could be ignored.

Tuning sPLS-DA

The parameters to choose in sPLS-DA are the number of variables to select (keepX) and the number of components (ncomp). To do this we use the function tune.splsda. In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.

tune.splsda needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of keepX values (make sure you choose the appropriate M fold cross-validation and provide sufficient nrepeat in the evaluation model, except for ‘loo’ where it can only be run on 1 repeat), also check the stability of the features selected during the cross-validation process.

# this chunk takes ~2 min to run with 6.3.0
set.seed(33)  # for reproducible results for this code
koren.tune.splsda = tune.splsda(koren.TSS, 
                          Y = Y, 
                          ncomp = 3, 
                          multilevel = NULL, 
                          logratio = 'CLR',
                          test.keepX = c(seq(5,150, 5)), 
                          validation = c('Mfold'), 
                          folds = 5, 
                          dist = 'max.dist', # prediction distance can be chosen according to tune.plsda results
                          nrepeat = 10,
                          progressBar = FALSE)
# may show some convergence issues for some of the cases, it is ok for tuning
comp1 comp2 comp3
5 0.3822222 0.0177778 0.0940171
10 0.3688889 0.0111111 0.0851282
15 0.3644444 0.0044444 0.0829060
20 0.3511111 0.0022222 0.0717949
25 0.3400000 0.0022222 0.0695726
30 0.3333333 0.0000000 0.0673504

plot of chunk unnamed-chunk-4

The graphic above shows that the error rate decreases when 2 components are included in sPLS-DA, wherea the third component seems to add noise. The diamonds indicate the optimal keepX variables to select on each component based on the balanced error rate.

# optimal number of variables to select on 2 comps:
select.keepX = koren.tune.splsda$choice.keepX[1:2]
## comp1 comp2 
##   130    30


We now run a classic sPLS-DA.

Note: if you have not tuned your sPLSDA model (or you are unhappy with the size of the selection) you can set the optimal keepX manually, e.g.:

# select.keepX = c(150, 15) # manually choose size of selection
# the sPLS-DA
koren.splsda = splsda(X = koren.TSS,  Y = Y, ncomp = 2, keepX = select.keepX, logratio= "CLR")

sPLS-DA plots

The sample plot below shows the sPLS-DA first two components. The ellipse are 0.95 confidence interval ellipse for each body site.

          ind.names = F, 
 = color.mixo(1:3), 
          comp = c(1,2),
          pch = 16, 
          ellipse = TRUE,
          legend = TRUE,
          title = 'Koren, sPLS-DA comp 1-2')

plot of chunk splsda-plot

Evaluating sPLS-DA

The classification performance of the sPLS-DA multilevel model can be assessed using the function perf. The mean error rates per component are output and type of distance are output. Here do not hesitate to increase the number of repeats for accurate estimations.

set.seed(34)  # for reproducible results for this code

koren.perf.splsda = perf(koren.splsda, validation = 'Mfold', folds = 5, 
                   progressBar = FALSE, nrepeat = 10, dist = 'max.dist')

The mean overall and Balanced Error Rate are output for each component.

## $overall
##         max.dist
## comp 1 0.3209302
## comp 2 0.0000000
## $BER
##         max.dist
## comp 1 0.3446154
## comp 2 0.0000000

plot of chunk unnamed-chunk-8

OTU selection and plots

The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to answer the question 'which bacteria characterise those body sites?'

The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component. Consider the absolute value as an indication of the importance of the OTU in the microbial signature, while the sign generally indicates positive / negative correlations between the OTUs, relatively to the proportions of the others (see also with a plotVar output).

For example on comp 2:

head(selectVar(koren.splsda, comp = 2)$value) 
##          value.var
## 406145  -0.4152795
## 370772  -0.3480359
## 1089121  0.2826586
## 2356875 -0.2800504
## 388506  -0.2681228
## 925707   0.2532633

To this list of selected OTU displayed from the most important to the least important we can combine their stability measure from the perf output (i.e. how often were they selected across the different CV runs).

selected.OTU.comp2 = selectVar(koren.splsda, comp = 2)$name
# stability of OTUs selected on comp 2
Var1 Freq
406145 0.98
370772 0.98
1089121 1.00
2356875 0.98
388506 0.98
925707 1.00
4467774 0.98
3057707 0.98
4304528 1.00
824705 0.98
2360704 0.98
4383961 0.96
971038 1.00
2221284 0.92
802262 0.96

The plotLoadings shows that all OTUs selected on the first component are mostly highly abundant in faecal (based on their mean per body site). Let's display only the top 50 selected OTUs

plotLoadings(koren.splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1, ndisplay = 50, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-12

The plotLoadings shows that all OTUs selected on the second component are highly abundant in the tow other bodysites. The sign indicates the opposition between body sites.

plotLoadings(koren.splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-13

We could improve the plot by showing only the top OTUs, and with their names at the Family level, followed by their OTU ID

name.var = paste(Koren.16S$taxonomy[, 'Family'], colnames(koren.TSS), sep = '|')

plotLoadings(koren.splsda, comp = 2, method = 'mean', contrib = 'max', name.var = name.var, size.title = 1, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-14

Clustered Image Map

A heatmap will also help understanding the microbial signature.
We represent clustered image maps (with Euclidian distance, Ward linkage set by default) for the OTUs selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, or a specific cutoff of 'association', see ?cim.

#CIM plot
cim(koren.splsda, row.sideColors = color.mixo(Y))

plot of chunk CIM


  1. Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C. and Ley, R.E., 2013. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol, 9(1), p.e1002863

  2. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169

  3. Lê Cao, K.-A., Boitard, S., Besse, P.: Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics 12(1), 253 (2011)

  4. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.