Human Microbiome Project 16s data
Here we apply the mixMC framework on the Human Microbiome Most Diverse 16S data set, which includes OTU counts on the three most diverse bodysites: Subgingival plaque (Oral), Antecubital fossa (Skin) and Stool sampled from 54 unique individuals for a total of 162 samples. The prefiltered dataset includes 1,674 OTU counts. The data were downloaded from the Human Microbiome Project. The original data contained 43,146 OTU counts for 2,911 samples measured from 18 different body sites.
We focus here on the statistical exploration and analysis of the repeated measurement data set using multivariate projection-based approaches such as Principal Components Analysis (PCA) (data mining and exploration) and sparse Partial Least Squares Discriminant Analysis (sPLS-DA) (for identification of indicator species characterising each body site).
Launch mixOmics, which can be downloaded on our website.
Dont forget to update to the latest mixOmics version.
Data: Most Diverse Data Set
Using the NIH Human Microbial Project 16S data in this example we assume the data are prefiltered and normalised with TSS, as described in mixMC normalisation.
data("diverse.16S") # the 16S normalised data data.mixMC = diverse.16S$data.TSS
We set up the outcome “Y” which is a factor indicating the body sites of each sample and “sample” which indicates the unique ID of each repeated individual (repeated measures design).
# the outcome Y = diverse.16S$bodysite # unique ID of each individual for multilevel analysis sample = diverse.16S$sample
Unsupervised Analsysis with PCA
Using unsupervised PCA allows for dimension reduction of the data and visualisation of diversity patterns in microbiome studies. Because we are using compositional data spurious results can arise if data is not transformed into Euclidean space. Therefore the data must be transformed using either ILR or CLR.
Here we chose the TSS normalisation ILR transformation as advised by Filmozer et al. . The ‘CLR’ log ratio transformation can be used (faster, but slightly different results by changing the parameter to logratio = “CLR”).
pca.res = pca(data.mixMC, ncomp = 10, logratio = 'CLR', multilevel = sample) #pca.res plot(pca.res)
The PCA plot indicates that 2 components are sufficient to explain most of the variation in the data.
- *You can play around with the log ratio transformations or use different normalisation (CSS), and multilevel vs non multilevel to see the benefit of taking into account the repeated measures design in the PCA model.
plotIndiv(pca.res, comp = c(1,2), # the components to plot pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:3), legend = TRUE)
We can observe a natural separation between those (very diverse) body sites. Next we hope to find out which OTUs drive this separation.
Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA
The mixMC frameworks uses the sPLS-DA multivariate analysis from mixOmics. The sPLS-DA selects the most discriminative OTUs that best characterize each body site, see sPLS-DA.
Firstly, to choose the number of components for sPLS_DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data). We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).
plsda <- plsda(X = data.mixMC, Y, ncomp = nlevels(Y) - 1) perf.plsda <- perf(plsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) plot(perf.plsda, overlay = 'measure', sd=TRUE)
plotIndiv(plsda , comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMB Most Diverse, PLSDA comp 1 - 2')
The parameters to choose in sPLS-DA are the number of variables to select (keepX) and the number of components (ncomp) and to do this we used the function tune.splsda(). In this example the sPLS-DA tuning step is performed on 3 components, using 5-fold validation repeated 10 times.
The tune.splsda() needs to be performed prior to the sPLS-DA analysis to choose the parameters on a grid of keepX values (make sure you choose the appropriate M fold cross-validation and provide sufficient nrepeat = 50 in the evaluation model, except for ‘loo’ where it is run on 1 repeat), also check the stability of the features selected during the cross-validation process.
# this chunk takes ~ 12 min to run splsda.tune = tune.splsda(data.mixMC, Y, ncomp = 3, multilevel = sample, logratio = 'CLR', test.keepX = c(seq(5,150, 5)), validation = 'Mfold', folds = 5, dist = 'max.dist', nrepeat = 10) # may show some convergence issues for some of the cases, it is ok for tuning
# To gain some computing time on the tuning, directly load the data load('RData/mixMC-tune-sPLSDA.RData') kable(head(splsda.tune$error.rate)) plot(splsda.tune)
# optimal number of variables to select on 3 comps: select.keepX = splsda.tune$choice.keepX[1:3] select.keepX
The graphic above shows that the error rate decreases when more components are included in sPLS-DA.
We now run a sPLS-DA multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation (see details here).
splsda.res <- splsda(data.mixMC, Y, ncomp = 3, keepX = select.keepX) plotIndiv(splsda.res, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'HMB Most Diverse, sPLSDA comp 1 - 2')
The classification performance of the sPLS-DA multilevel model can be assessed using the function perf(). The mean error rates per component are output and type of distance are output.
# setting the seed so we get the same results with participants: set.seed(34) splsda.perf = perf(splsda.res, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) splsda.perf$error.rate
## $overall ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.3703704 0.2580247 0.2580247 ## comp 2 0.2376543 0.3327160 0.2475309 ## comp 3 0.1969136 0.3154321 0.2061728 ## ## $BER ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.3703704 0.2580247 0.2580247 ## comp 2 0.2376543 0.3327160 0.2475309 ## comp 3 0.1969136 0.3154321 0.2061728
The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The contribution plots below display the abundance of each OTU (large abundance = large absolute value) and in which body site they are the most abundant for each sPLS-DA component. The contribution plots need to be interpreted in combination with the sample plot below to understand the similarities between body sites, in addition to 'which bacteria characterize those body sites?'
The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component:
head(selectVar(splsda.res, comp = 1)$value)
## value.var ## OTU_97.108 -0.6905934 ## OTU_97.38174 -0.5893261 ## OTU_97.38336 -0.3196079 ## OTU_97.33 -0.1723393 ## OTU_97.39439 -0.1370533 ## OTU_97.24573 -0.1216348
par(mfrow=c(1,3)) plotLoadings(splsda.res, comp = 1, method = 'mean', contrib = 'max')
Clustered Image Map
#CIM plot cim(splsda.res, row.sideColors = color.mixo(Y))
Le Cao, K.A., Costello, M.E., Lakis, V.A., Bartolo, F., Chua, X.Y., Brazeilles, R. and Rondeau, P., 2016. mixMC: a multivariate statistical framework to gain insight into Microbial Communities. bioRxiv, p.044206
Le Cao, K.A., Boitard, S. and Besse, P., 2011. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics, 12(1), p.1.