Case study: HMP bodysites repeated measures

Note: this case study is quite detailed for illustrative purposes. You can skip some parts for your analyses. For more details about the methods and performance measures we use, have a look at our list of references, and any of the other tabs focusing on specific methods.

Here we analyse a subset of the Human Microbiome Most Diverse 16S data set, which includes OTU counts on the three most diverse body sites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool, sampled from 54 unique individuals for a total of 162 samples. The pre-filtered data set includes 1,674 OTU counts. The data were downloaded from the Human Microbiome Project. The original data contained 43,146 OTU counts for 2,911 samples measured from 18 different body sites.

We focus here on the statistical exploration and analysis of a repeated measurement design using multivariate projection-based approaches such as Principal Component Analysis (PCA) (data mining and exploration) and sparse Partial Least Squares Discriminant Analysis (sPLS-DA) (for the identification of indicator species characterising each body site). More details about the multilevel approach we use here.


Using the NIH Human Microbial Project 16S data in this example, we assume the data are pre-filtered and offset, as described in mixMC pre-processing steps.


# offset data, filtered
data = diverse.16S$data.raw

We set up the outcome Y which is a factor indicating the body sites of each sample and sample which indicates the unique ID of each unique individual (repeated measures design).

# the outcome
Y = diverse.16S$bodysite
# unique ID of each individual for multilevel analysis
sample = diverse.16S$sample

Unsupervised Analysis with PCA

PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are dealing with compositional data, spurious results can arise if the data are not transformed into an Euclidean space. Therefore, we use CLR transformation , using the argument logratio = 'CLR' in our functions.

We first run a PCA with a sufficiently large number of components ncomp to choose the final reduced dimension of the model.

diverse.pca = pca(data, ncomp = 10, logratio = 'CLR', multilevel = sample)

plot of chunk unnamed-chunk-3

The PCA plot indicates that 2 components are sufficient to explain most of the variation in the data.

You could test different log ratio transformations or use different types of normalisation, multilevel vs non multilevel to see the benefit of taking into account the repeated measures design in the PCA model. Our team favours the CLR transformation.

          comp = c(1,2), # the components to plot
          ind.names = F,
          group = Y,
          legend = TRUE,
          title = 'HMP most diverse, PCA comp 1 - 2')

plot of chunk unnamed-chunk-4

We observe a natural separation between those (very diverse) body sites. In the next steps we run sPLS-DA to identify which OTUs may drive this separation.

Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA

The mixMC frameworks uses the sPLS-DA multivariate analysis from mixOmics [3]. The sPLS-DA selects the most discriminative OTUs that best characterize each body site, see sPLS-DA.

Note that with PLS-DA and sPLS-DA we can only choose a CLR transformation (see details here).

We first run the perf function with a PLS-DA model with no variable selection. Often ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but it depends on the data. We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).

Here we assess the performance of the PLS-DA on 3 components.

diverse.plsda = plsda(X = data, Y, ncomp = nlevels(Y), logratio = 'CLR', multilevel = sample)

diverse.perf.plsda = perf(diverse.plsda, validation = 'Mfold', folds = 5,
                    progressBar = FALSE, nrepeat = 10)

plot(diverse.perf.plsda, overlay = 'measure', sd = TRUE)

plot of chunk unnamed-chunk-5

The plot indicates a decrease in the classification error rate (i.e. an increase in classification performance) from one component to 2 components in the model. The BER stands for Balanced Error Rate and should be considered when we have an unbalanced number of samples per group. Here this is not the case so both overall and BER are overlapping. The performance does not increase after 2 components, which suggest ncomp = 2 for a final PLS-DA model. Note that for the sparse PLS-DA we may obtain a different ncomp.

For more details on the prediction distances, refer to the Supplementary information in Rohart et al. 2017 [4]. The distance ‘max.dist’ seems to give the best performance and will be our distance of choice for sPLS-DA.

plotIndiv(diverse.plsda , comp = c(1,2), ind.names = FALSE,
          ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 2')

plot of chunk unnamed-chunk-6

Here we could check that the third component does not add additional information for the discrimination of the body sites.

plotIndiv(diverse.plsda, comp = c(1,3), ind.names = FALSE,
          ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 3')

plot of chunk unnamed-chunk-7

Tuning sPLS-DA

The parameters to choose in sPLS-DA are the number of variables to select (keepX) and the number of components (ncomp). To do this we use the function tune.splsda. In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.

The function tune.splsda needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of keepX values (make sure you choose the appropriate M-fold cross-validation and provide sufficient nrepeat in the evaluation model, except for 'loo' where it can only be run on 1 repeat), also check the stability of the features selected during the cross-validation process.

# first, set a grid of values to test:
grid.keepX = c(seq(5,150, 5))  
# if you dont understand what this means, type
# grid.keepX  # adjust this grid as necessary for your own data

# this chunk takes ~2 min to run
set.seed(33)  # for reproducible results for this code, remove for your own code
diverse.tune.splsda = tune.splsda(data, Y, 
                                  ncomp = 3,
                                  logratio = 'CLR',
                                  multilevel = sample,
                                  test.keepX = grid.keepX, 
                                  validation = 'Mfold',
                                  folds = 5, dist = 'max.dist', 
                                  nrepeat = 10,
                                  progressBar = FALSE)
# may show some convergence issues for some of the CV-runs, it is ok for tuning
comp1 comp2 comp3
5 0.3333333 0.0246914 0.0425926
10 0.3333333 0.0253086 0.0679012
15 0.3333333 0.0246914 0.0771605
20 0.3333333 0.0228395 0.0895062
25 0.3333333 0.0222222 0.1006173

plot of chunk unnamed-chunk-9

The graphic above shows that the error rate decreases when more components are included in sPLS-DA. The diamonds indicate the optimal keepX variables to select on each component based on the balanced error rate. Our final sPLS-DA model will be run with 2 components as the addition of the third does not seem to improve the classification performance.

# optimal number of variables to select on 2 comps:
select.keepX = diverse.tune.splsda$choice.keepX[1:2]
## comp1 comp2 
##     5    65


We now run a sPLS-DA multilevel analysis. Note: if you have not tuned your sPLS-DA model (or you are unhappy with the size of the selection) you can set the optimal keepX manually, e.g.:

# select.keepX = c(50, 15) # to manually choose size of selection
diverse.splsda = splsda(data, Y, ncomp = 2, logratio = 'CLR', multilevel = sample, keepX = select.keepX)

plotIndiv(diverse.splsda, comp = c(1,2),
          ind.names = FALSE,
          ellipse = TRUE, legend = TRUE,
          title = 'HMP Most Diverse, sPLSDA comp 1 - 2')

plot of chunk unnamed-chunk-12

Do no hesitate to add other components and look at the sample plot to visualise the potential benefit of adding a third component.

Evaluating sPLS-DA

The classification performance of the sPLS-DA multilevel model can be assessed using the function perf. The mean error rates per component are output and type of distance are output. Here, do not hesitate to increase the number of repeats for accurate estimations.

set.seed(34)  # for reproducible results for this code, remove for your own code

diverse.perf.splsda = perf(diverse.splsda, validation = 'Mfold', folds = 5,
                   progressBar = FALSE, nrepeat = 10, dist = 'max.dist')

The mean overall and Balanced Error Rate are output for each component.

## $overall
##         max.dist
## comp1 0.33333333
## comp2 0.01728395
## $BER
##         max.dist
## comp1 0.33333333
## comp2 0.01728395

plot of chunk unnamed-chunk-13

The performance plot confirms that 2 components are sufficient in the sPLS-DA model.

Contribution plots

The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to answer the question “Which bacteria characterise those body sites?”“

The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component. Consider the absolute value as an indication of the importance of the OTU in the microbial signature, while the sign generally indicates positive / negative correlations between the OTUs, relative to the proportions of the others (see also with a plotVar output).

head(selectVar(diverse.splsda, comp = 1)$value)
##               value.var
## OTU_97.38174 -0.6421273
## OTU_97.39439 -0.5842682
## OTU_97.108   -0.3633238
## OTU_97.20    -0.2927394
## OTU_97.39456 -0.1691232

To this list of selected OTU displayed from the most important to the least important we can combine their stability measure from the perf output (i.e. how often were they selected across the different CV runs).

selected.OTU.comp1 = selectVar(diverse.splsda, comp = 1)$name
# stability of OTUs selected on comp 1
kable(data.frame(diverse.perf.splsda$features$stable[[1]][selected.OTU.comp1]), caption = '
Stability of OTUs selected on comp 1, ranked by decreasing importance in sPLS-DA.')
Var1 Freq
OTU_97.38174 1.00
OTU_97.39439 0.96
OTU_97.108 0.96
OTU_97.20 0.86
OTU_97.39456 0.42

The plotLoadings shows that all OTUs selected on the first component are highly abundant in plaque (based on their mean per body site).

plotLoadings(diverse.splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1)

plot of chunk unnamed-chunk-17

The plotLoadings shows that all OTUs selected on the second component below are primarily highly abundant in the skin body site. The sign indicates the opposition between body sites.

plotLoadings(diverse.splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1)

plot of chunk unnamed-chunk-18

We could improve the plot by showing only the top OTUs, and with their names at the Family level, followed by their OTU ID.

name.var = paste(diverse.16S$taxonomy[, 'Family'], rownames(data), sep = '|')

plotLoadings(diverse.splsda, comp = 2, method = 'mean', contrib = 'max', ndisplay = 10, name.var = name.var, size.title = 1, = 0.5, size.legend = 0.5)

plot of chunk unnamed-chunk-19

Clustered Image Map

A heatmap will also help to understand the microbial signature. We represent Clustered Image Maps (with Euclidian distance, Ward linkage set by default) for the OTUs selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, or a specific cutoff of association', see ?cim.

Note, if the margins are too large, use either X11() prior, or use the save and in the cim function.

#CIM plot
cim(diverse.splsda, row.sideColors = color.mixo(Y))

plot of chunk CIM


  1. Filzmoser, P., Hron, K. and Reimann, C., 2009. Principal component analysis for compositional data with outliers. Environmetrics, 20(6), pp.621-632.

  2. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169

  3. Lê Cao, K.A., Boitard, S. and Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12(1), p.1.

  4. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for 'omics feature selection and multiple data integration.