Note: this case study is quite detailed for illustrative purposes. You can skip some parts for your analyses. For more details about the methods and performance measures we use, have a look at our list of references, and any of the other tabs focusing on specific methods.
Here we analyse a subset of the Human Microbiome Most Diverse 16S data set, which includes OTU counts on the three most diverse body sites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool, sampled from 54 unique individuals for a total of 162 samples. The pre-filtered data set includes 1,674 OTU counts. The data were downloaded from the Human Microbiome Project. The original data contained 43,146 OTU counts for 2,911 samples measured from 18 different body sites.
We focus here on the statistical exploration and analysis of a repeated measurement design using multivariate projection-based approaches such as Principal Component Analysis (PCA) (data mining and exploration) and sparse Partial Least Squares Discriminant Analysis (sPLS-DA) (for the identification of indicator species characterising each body site). More details about the multilevel approach we use here.
Using the NIH Human Microbial Project 16S data in this example, we assume the data are pre-filtered and offset, as described in mixMC pre-processing steps.
library(mixOmics) data("diverse.16S") # offset data, filtered data = diverse.16S$data.raw
We set up the outcome Y which is a factor indicating the body sites of each sample and sample which indicates the unique ID of each unique individual (repeated measures design).
# the outcome Y = diverse.16S$bodysite # unique ID of each individual for multilevel analysis sample = diverse.16S$sample
Unsupervised Analysis with PCA
PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are dealing with compositional data, spurious results can arise if the data are not transformed into an Euclidean space. Therefore, we use CLR transformation , using the argument
logratio = 'CLR' in our functions.
We first run a PCA with a sufficiently large number of components
ncomp to choose the final reduced dimension of the model.
diverse.pca = pca(data, ncomp = 10, logratio = 'CLR', multilevel = sample) #diverse.pca plot(diverse.pca)
The PCA plot indicates that 2 components are sufficient to explain most of the variation in the data.
You could test different log ratio transformations or use different types of normalisation, multilevel vs non multilevel to see the benefit of taking into account the repeated measures design in the PCA model. Our team favours the CLR transformation.
plotIndiv(diverse.pca, comp = c(1,2), # the components to plot ind.names = F, group = Y, legend = TRUE, title = 'HMP most diverse, PCA comp 1 - 2')
We observe a natural separation between those (very diverse) body sites. In the next steps we run sPLS-DA to identify which OTUs may drive this separation.
Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA
The mixMC frameworks uses the sPLS-DA multivariate analysis from mixOmics . The sPLS-DA selects the most discriminative OTUs that best characterize each body site, see sPLS-DA.
Note that with PLS-DA and sPLS-DA we can only choose a CLR transformation (see details here).
We first run the
perf function with a PLS-DA model with no variable selection. Often
ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but it depends on the data. We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).
Here we assess the performance of the PLS-DA on 3 components.
diverse.plsda = plsda(X = data, Y, ncomp = nlevels(Y), logratio = 'CLR', multilevel = sample) diverse.perf.plsda = perf(diverse.plsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) plot(diverse.perf.plsda, overlay = 'measure', sd = TRUE)
The plot indicates a decrease in the classification error rate (i.e. an increase in classification performance) from one component to 2 components in the model. The BER stands for Balanced Error Rate and should be considered when we have an unbalanced number of samples per group. Here this is not the case so both overall and BER are overlapping. The performance does not increase after 2 components, which suggest
ncomp = 2 for a final PLS-DA model. Note that for the sparse PLS-DA we may obtain a different
For more details on the prediction distances, refer to the Supplementary information in Rohart et al. 2017 . The distance ‘max.dist’ seems to give the best performance and will be our distance of choice for sPLS-DA.
plotIndiv(diverse.plsda , comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 2')
Here we could check that the third component does not add additional information for the discrimination of the body sites.
plotIndiv(diverse.plsda, comp = c(1,3), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 3')
The parameters to choose in sPLS-DA are the number of variables to select (
keepX) and the number of components (
ncomp). To do this we use the function
tune.splsda. In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.
tune.splsda needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of
keepX values (make sure you choose the appropriate M-fold cross-validation and provide sufficient
nrepeat in the evaluation model, except for
'loo' where it can only be run on 1 repeat), also check the stability of the features selected during the cross-validation process.
# first, set a grid of values to test: grid.keepX = c(seq(5,150, 5)) # if you dont understand what this means, type # grid.keepX # adjust this grid as necessary for your own data # this chunk takes ~2 min to run set.seed(33) # for reproducible results for this code, remove for your own code diverse.tune.splsda = tune.splsda(data, Y, ncomp = 3, logratio = 'CLR', multilevel = sample, test.keepX = grid.keepX, validation = 'Mfold', folds = 5, dist = 'max.dist', nrepeat = 10, progressBar = FALSE) # may show some convergence issues for some of the CV-runs, it is ok for tuning
The graphic above shows that the error rate decreases when more components are included in sPLS-DA. The diamonds indicate the optimal
keepX variables to select on each component based on the balanced error rate. Our final sPLS-DA model will be run with 2 components as the addition of the third does not seem to improve the classification performance.
# optimal number of variables to select on 2 comps: select.keepX = diverse.tune.splsda$choice.keepX[1:2] select.keepX
## comp1 comp2 ## 5 65
We now run a sPLS-DA multilevel analysis. Note: if you have not tuned your sPLS-DA model (or you are unhappy with the size of the selection) you can set the optimal
keepX manually, e.g.:
# select.keepX = c(50, 15) # to manually choose size of selection
diverse.splsda = splsda(data, Y, ncomp = 2, logratio = 'CLR', multilevel = sample, keepX = select.keepX) plotIndiv(diverse.splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, sPLSDA comp 1 - 2')
Do no hesitate to add other components and look at the sample plot to visualise the potential benefit of adding a third component.
The classification performance of the sPLS-DA multilevel model can be assessed using the function
perf. The mean error rates per component are output and type of distance are output. Here, do not hesitate to increase the number of repeats for accurate estimations.
set.seed(34) # for reproducible results for this code, remove for your own code diverse.perf.splsda = perf(diverse.splsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10, dist = 'max.dist')
The mean overall and Balanced Error Rate are output for each component.
## $overall ## max.dist ## comp1 0.33333333 ## comp2 0.01728395 ## ## $BER ## max.dist ## comp1 0.33333333 ## comp2 0.01728395
The performance plot confirms that 2 components are sufficient in the sPLS-DA model.
The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to answer the question “Which bacteria characterise those body sites?”“
The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component. Consider the absolute value as an indication of the importance of the OTU in the microbial signature, while the sign generally indicates positive / negative correlations between the OTUs, relative to the proportions of the others (see also with a
head(selectVar(diverse.splsda, comp = 1)$value)
## value.var ## OTU_97.38174 -0.6421273 ## OTU_97.39439 -0.5842682 ## OTU_97.108 -0.3633238 ## OTU_97.20 -0.2927394 ## OTU_97.39456 -0.1691232
To this list of selected OTU displayed from the most important to the least important we can combine their stability measure from the
perf output (i.e. how often were they selected across the different CV runs).
selected.OTU.comp1 = selectVar(diverse.splsda, comp = 1)$name # stability of OTUs selected on comp 1 diverse.perf.splsda$features$stable[][selected.OTU.comp1]
kable(data.frame(diverse.perf.splsda$features$stable[][selected.OTU.comp1]), caption = ' Stability of OTUs selected on comp 1, ranked by decreasing importance in sPLS-DA.')
plotLoadings shows that all OTUs selected on the first component are highly abundant in plaque (based on their mean per body site).
plotLoadings(diverse.splsda, comp = 1, method = 'mean', contrib = 'max', size.title = 1)
plotLoadings shows that all OTUs selected on the second component below are primarily highly abundant in the skin body site. The sign indicates the opposition between body sites.
plotLoadings(diverse.splsda, comp = 2, method = 'mean', contrib = 'max', size.title = 1)
We could improve the plot by showing only the top OTUs, and with their names at the Family level, followed by their OTU ID.
name.var = paste(diverse.16S$taxonomy[, 'Family'], rownames(data), sep = '|') plotLoadings(diverse.splsda, comp = 2, method = 'mean', contrib = 'max', ndisplay = 10, name.var = name.var, size.title = 1, size.name = 0.5, size.legend = 0.5)
Clustered Image Map
A heatmap will also help to understand the microbial signature. We represent Clustered Image Maps (with Euclidian distance, Ward linkage set by default) for the OTUs selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, or a specific cutoff of association', see
Note, if the margins are too large, use either
X11() prior, or use the
name.save in the
#CIM plot cim(diverse.splsda, row.sideColors = color.mixo(Y))
Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169
Lê Cao, K.A., Boitard, S. and Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12(1), p.1.