Case study: Koren diverse bodysites

Note: this case study is quite detailed for illustrative purposes. You can skip some parts for your analyses. For more details about the methods and performance measures we use, have a look at our list of references, and any of the other tabs focusing on specific methods.

Here we apply mixMC to the 16S study of Koren et al. (2011) [1] who examined the link between oral, gut and plaque microbial communities in patients with atherosclerosis and controls. Only healthy individuals were retained in the analysis. This study contains measures from different bodysites, saliva and stool and arterial plaque. The design of the experiment is partially repeated (15 unique patients for saliva and stool and 13 for arterial plaque). For simplicity here we consider the experiment as non-repeated. See the HMP example in the Tab for a repeated measures analysis.

The data were downloaded from the Qiita database. After pre-filtering, the data included 980 OTU for 43 samples. Data were TSS normalized and log ratio transformations are applied in PCA and PLS-DA.


Using the Koren data in this example, we assume the data are offset and pre-filtered, as described in mixMC pre-processing steps.

# offset data, filtered
data = Koren.16S$data.raw

We set up the outcome Y as a factor indicating the body sites of each sample.

Y  =  Koren.16S$bodysite

Unsupervised Analysis with PCA

PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are dealing with compositional data, spurious results can arise if the data are not transformed into an Euclidean space. Therefore, we use CLR transformation , using the argument logratio = 'CLR' in our functions.

We first run a PCA with a sufficiently large number of components ncomp to choose the final reduced dimension of the model.

koren.pca = pca(data, ncomp = 10, logratio = 'CLR')

plot of chunk pca

The barplot above depicts the percentage of explained variance per component for a PCA with 10 components. It is a good indicator of how many components to retain in the final PCA model (here two components will be sufficient).

The sample plot below displays the samples coloured according to Y, but note that PCA does not take Y into account in the analysis!

          comp = c(1,2), # the components to plot
          ind.names = F, # not showing sample names
          group = Y, # color according to Y
          legend = TRUE,
          title = 'Koren, PCA comp 1 - 2')

plot of chunk plotIndiv PCA

Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA

The mixMC framework uses the sPLS-DA multivariate analysis from mixOmics [3]. The sPLS-DA selects the most discriminative OTUs that best characterise each body site, see sPLS-DA.

Note that with PLS-DA and sPLS-DA we can only choose a CLR transformation (see details here.

We first run the perf function with a PLS-DA model with no variable selection. Often ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but it depends on the data. We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).

Here we assess the performance of the PLS-DA on 3 components.

koren.plsda = plsda(X = data, Y, ncomp = nlevels(Y), logratio = 'CLR')

koren.perf.plsda = perf(koren.plsda, validation = 'Mfold', folds = 5,
                    progressBar = FALSE, nrepeat = 10)

plot(koren.perf.plsda, overlay = 'measure', sd=TRUE)

plot of chunk plsda

The plot indicates a decrease in the classification error rate (i.e. an increase in classification performance) from one component to 3 components in the model. The BER stands for Balanced Error Rate and should be considered when there is an unbalanced number of samples per group. Here the number of samples per group is very similar, which explains why both overall and BER overlap. The performance reaches its best for 3 components, which suggests ncomp = 3 for a final PLS-DA model. Note that for the sparse PLSDA we may obtain a different ncomp.

For more details on the prediction distances, refer to Supplementary information in Rohart et al. 2017 [4]. Here all distances are similar, we choose 'max.dist' for sPLS-DA.

#First two components
plotIndiv(koren.plsda , comp = c(1,2),
          group = Y, ind.names = FALSE,
          ellipse = TRUE, legend = TRUE, title = 'Koren, PLSDA comp 1 - 2')

plot of chunk plot plsda

It is possible that the third component may not visually add much to the model and could be ignored.

Tuning sPLS-DA

The parameters to choose in sPLS-DA are the number of variables to select (keepX) and the number of components (ncomp). To do this we use the function tune.splsda. In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.

tune.splsda needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of keepX values (make sure you choose the appropriate M fold cross-validation and provide sufficient nrepeat in the evaluation model, except for loo where it can only be run on 1 repeat). Also check the stability of the features selected during the cross-validation process.

# first, set a grid of values to test:
grid.keepX = c(seq(5,150, 5))  
# if you dont understand what this means, type
# grid.keepX  # adjust this grid as necessary for your own data

# this chunk takes ~2 min to run
set.seed(33)  # for reproducible results for this code, remove for your own code
koren.tune.splsda = tune.splsda(data,
                          Y = Y,
                          ncomp = 3,
                          logratio = 'CLR',
                          test.keepX = grid.keepX,
                          validation = c('Mfold'),
                          folds = 5,
                          dist = 'max.dist', # prediction distance can be chosen according to tune.plsda results
                          nrepeat = 10,
                          progressBar = FALSE)
# may show some convergence issues for some of the cases, it is ok for tuning
comp1 comp2 comp3
5 0.3888889 0.0088889 0.0666667
10 0.3600000 0.0066667 0.0622222
15 0.3511111 0.0044444 0.0644444
20 0.3533333 0.0022222 0.0622222
25 0.3400000 0.0022222 0.0622222
30 0.3323077 0.0022222 0.0555556

plot of chunk unnamed-chunk-3

The graphic above shows that the error rate decreases when 2 components are included in the sPLS-DA, whereas the third component seems to add noise. The diamonds indicate the optimal keepX variables to select on each component based on the balanced error rate.

# optimal number of variables to select on 2 comps:
select.keepX = koren.tune.splsda$choice.keepX[1:2]
## comp1 comp2 
##   130    40


We now run a classic sPLS-DA.

Note: if you have not tuned your sPLS-DA model (or you are unhappy with the size of the selection) you can set the optimal keepX manually, e.g.:

# select.keepX = c(50, 65) # manually choose size of selection
# the sPLS-DA
koren.splsda = splsda(X = data,  Y = Y, ncomp = 2, keepX = select.keepX, logratio= "CLR")

sPLS-DA plots

The sample plot below shows the first two components for the sPLS-DA. The ellipse are 0.95 confidence interval ellipse for each body site.

          comp = c(1,2),
          ind.names = F,
          ellipse = TRUE,
          legend = TRUE,
          title = 'Koren, sPLS-DA comp 1-2')

plot of chunk splsda-plot

Evaluating sPLS-DA

The classification performance of the sPLS-DA multilevel model can be assessed using the function perf. The mean error rates per component and the type of distance are output. It can be beneficial to increase the number of repeats for more accurate estimations.

set.seed(34)  # for reproducible results for this code, remove for your own code

koren.perf.splsda = perf(koren.splsda, validation = 'Mfold', folds = 5,
                   progressBar = FALSE, nrepeat = 10, dist = 'max.dist')

The mean overall and Balanced Error Rate are output for each component.

## $overall
##        max.dist
## comp1 0.3046512
## comp2 0.0000000
## $BER
##        max.dist
## comp1 0.3297436
## comp2 0.0000000

plot of chunk unnamed-chunk-7

OTU selection and plots

The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to answer the question “Which bacteria characterise those body sites?”

The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component. Consider the absolute value as an indication of the importance of the OTU in the microbial signature, while the sign generally indicates positive or negative correlations between the OTUs, relative to the proportions of the others (see also with a plotVar output).

For example on comp 2:

head(selectVar(koren.splsda, comp = 2)$value)
##          value.var
## 406145  -0.3725344
## 370772  -0.3191646
## 1089121  0.2676551
## 2356875 -0.2651746
## 388506  -0.2557496
## 925707   0.2444380

To this list of selected OTUs displayed from the most important to the least important, we can combine their stability measure from the perf output (i.e. how often were they selected across the different CV runs).

selected.OTU.comp2 = selectVar(koren.splsda, comp = 2)$name
# stability of OTUs selected on comp 2
Var1 Freq
406145 0.96
370772 0.96
1089121 1.00
2356875 0.96
388506 0.96
925707 1.00
4467774 0.96
3057707 0.96
4304528 1.00
824705 0.96
2360704 0.96
4383961 0.96
971038 1.00
2221284 0.96
802262 1.00

The plotLoadings plot shows that all OTUs selected on the first component are mostly highly abundant in faecal samples (based on their mean per body site). Let's display only the top 50 selected OTUs:

plotLoadings(koren.splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1, ndisplay = 50, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-11

The plotLoadings plot shows that all OTUs selected on the second component are highly abundant in the two other body sites. The sign indicates the opposition between body sites.

plotLoadings(koren.splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-12

We could improve the plot by showing only the top OTUs, and with their names at the Family level, followed by their OTU ID:

name.var = paste(Koren.16S$taxonomy[, 'Family'], colnames(data), sep = '|')

plotLoadings(koren.splsda, comp = 2, method = 'mean', contrib = 'max', name.var = name.var, size.title = 1, = 0.5, size.legend = 0.3)

plot of chunk unnamed-chunk-13

Clustered Image Map

A heatmap will also help to understand the microbial signature. We represent Clustered Image Maps (with Euclidian distance, Ward linkage set by default) for the OTUs selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, or a specific cutoff of 'association', see ?cim.

Note, if the margins are too large, use either X11() prior, or use the save and in the cim function.

#CIM plot
cim(koren.splsda, row.sideColors = color.mixo(Y))

plot of chunk CIM


  1. Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C. and Ley, R.E., 2013. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol, 9(1), p.e1002863

  2. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169

  3. Lê Cao, K.-A., Boitard, S., Besse, P.: Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics 12(1), 253 (2011)

  4. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.