Human Microbiome Project 16s data
Here we apply the mixMC framework on the Human Microbiome Most Diverse 16S data set, which includes OTU counts on the three most diverse bodysites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool sampled from 54 unique individuals for a total of 162 samples. The prefiltered dataset includes 1,674 OTU counts. The data were downloaded from the Human Microbiome Project. The original data contained 43,146 OTU counts for 2,911 samples measured from 18 different body sites.
We focus here on the statistical exploration and analysis of the repeated measurement data set using multivariate projection-based approaches such as Principal Components Analysis (PCA) (data mining and exploration) and sparse Partial Least Squares Discriminant Analysis (sPLS-DA) (for identification of indicator species characterising each body site).
Launch mixOmics, which can be downloaded on our website.
Dont forget to update to the latest mixOmics version.
Data: Most Diverse Data Set
Using the NIH Human Microbial Project 16S data in this example we assume the data are prefiltered and normalised with TSS, as described in mixMC normalisation.
data("diverse.16S") # the 16S normalised data data.mixMC = diverse.16S$data.TSS
We set up the outcome “Y” which is a factor indicating the body sites of each sample and “sample” which indicates the unique ID of each repeated individual (repeated measures design).
# the outcome Y = diverse.16S$bodysite # unique ID of each individual for multilevel analysis sample = diverse.16S$sample
Unsupervised Analsysis with PCA
Using unsupervised PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are using compositional data spurious results can arise if data is not transformed into Euclidean space. Therefore the data must be transformed using either ILR or CLR.
Here we chose the TSS normalisation ILR transformation as advised by Filmozer et al . We prefer the ‘CLR’ log ratio transformation cas it is faster and can be used all throughout our framework (from PCA to sPLSDA), using the argument logratio = “CLR”).
pca.res = pca(data.mixMC, ncomp = 10, logratio = 'CLR', multilevel = sample) #pca.res plot(pca.res)
The PCA plot indicates that 2 components are sufficient to explain most of the variation in the data.
You can test different log ratio transformations or use different normalisation (CSS), and multilevel vs non multilevel to see the benefit of taking into account the repeated measures design in the PCA model. Our team favours the normalisation TSS followed by CLR transformation.
plotIndiv(pca.res, comp = c(1,2), # the components to plot pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:3), legend = TRUE)
We can observe a natural separation between those (very diverse) body sites. Next we run sPLSDA to identify which OTUs may drive this separation.
Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA
The mixMC frameworks uses the sPLS-DA multivariate analysis from mixOmics. The sPLS-DA selects the most discriminative OTUs that best characterize each body site, see sPLS-DA.
Firstly, to choose the number of components for sPLS-DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data). We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).
Here we assess the performance of the PLSDA on 3 components.
plsda <- plsda(X = data.mixMC, Y, ncomp = nlevels(Y), logratio = 'CLR', multilevel = sample) perf.plsda <- perf(plsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) plot(perf.plsda, overlay = 'measure', sd = TRUE)
The plot indicates a decrease in the classification error rate (i.e. an increase in classification performance) from one component to 2 components in the model. The BER stands for Balanced Error Rate and should be considered when we have an unbalanced number of samples per group. Here this is not the case so both overall and BER are overlapping. We can observe that the performance does not increase after 2 components, which suggest ncomp = 2 for a final PLSDA model. Note that for the sparse PLSDA we may obtain a different ncomp.
For more details on the prediction distances, refer to Suppl info in Rohart et al. 2017 . The distance 'max.dist' seem to lead to the best performance and will be kept for sPLSDA.
plotIndiv(plsda , comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 2')
Here we could check that the third component does not add additional information for the discrimination of the body sites.
plotIndiv(plsda , comp = c(1,3), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, PLSDA comp 1 - 3')
The parameters to choose in sPLS-DA are the number of variables to select (keepX) and the number of components (ncomp) and to do this we used the function tune.splsda(). In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.
The tune.splsda() needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of keepX values (make sure you choose the appropriate M fold cross-validation and provide sufficient nrepeat = 50 in the evaluation model, except for ‘loo’ where it is run on 1 repeat), also check the stability of the features selected during the cross-validation process.
# this chunk takes ~ 12 min to run splsda.tune = tune.splsda(data.mixMC, Y, ncomp = 3, logratio = 'CLR', multilevel = sample, test.keepX = c(seq(5,150, 5)), validation = 'Mfold', folds = 5, dist = 'max.dist', nrepeat = 10) # may show some convergence issues for some of the cases, it is ok for tuning
# To gain some computing time on the tuning, directly load the data load('RData/mixMC-tune-sPLSDA.RData') kable(head(splsda.tune$error.rate)) plot(splsda.tune)
# optimal number of variables to select on 3 comps: select.keepX = splsda.tune$choice.keepX[1:3] select.keepX
The graphic above shows that the error rate decreases when more components are included in sPLS-DA. The diamonds indicate the optimal keepX variables to select on each component based on the balanced error rate. The addtion of the third component seem to improve the perfomance and so our final sPLS-DA model will be run for 3 components.
We now run a sPLS-DA multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation (see details here.
splsda.res <- splsda(data.mixMC, Y, ncomp = 3, logratio = 'CLR', multilevel = sample, keepX = select.keepX) plotIndiv(splsda.res, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, sPLSDA comp 1 - 2')
When we look however at the addition of the third component, it does not seem to discriminate the groups further. Hence we can focus primarily on the OTUs selected on the first 2 components.
plotIndiv(splsda.res, comp = c(1,3), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMP Most Diverse, sPLSDA comp 1 - 3')
The classification performance of the sPLS-DA multilevel model can be assessed using the function perf(). The mean error rates per component are output and type of distance are output. Here do not hesitate to increase the number of repeats for accurate estimations.
# setting the seed so we get the same results with participants: set.seed(34) splsda.perf = perf(splsda.res, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10, dist = 'max.dist') splsda.perf$error.rate
## $overall ## max.dist ## comp 1 0.33333333 ## comp 2 0.01358025 ## comp 3 0.01543210 ## ## $BER ## max.dist ## comp 1 0.33333333 ## comp 2 0.01358025 ## comp 3 0.01543210
The performance plot confirms that 2 components are sufficient in the sPLS-DA model
The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to 'which bacteria characterize those body sites?'
The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component. Consider the absolute value as an indication of the importance of the OTU in the microbial signature, while the sign generally indicates positive / negative correlations between the OTUs TSS-CLR counts (see also with a plotVar).
head(selectVar(splsda.res, comp = 1)$value)
## value.var ## OTU_97.38174 -0.5637904 ## OTU_97.39439 -0.5019205 ## OTU_97.108 -0.3956522 ## OTU_97.20 -0.3527520 ## OTU_97.39456 -0.2507409 ## OTU_97.55 -0.1754006
The plotLoadings shows that all OTUs selected on the first component are highly abundant in plaque.
plotLoadings(splsda.res, comp = 1, method = 'mean', contrib = 'max', size.title = 1.0)
The plotLoadings shows that all OTUs selected on the second component are highly abundant in the two other bodysites. The sign indicate the opposition between the body sites.
plotLoadings(splsda.res, comp = 2, method = 'mean', contrib = 'max', size.title = 1.0)