Koren Data 16s
Here we apply mixMC to the 16S study of Koren et al. (2011) who examined the link between oral, gut and plaque microbial communities in patients with atherosclerosis and controls. Only healthy individuals were retained in the analysis. This study contained partially repeated measures from multiple sites including 15 unique patients samples from saliva and stool, and 13 unique patients only sampled from arterial plaque samples and we therefore considered a non multilevel analysis for that experimental design (i.e. no repeated measurements). The data were downloaded from the Qiita database. After pre-filtering, the data included 973 OTU for 43 samples. Data were TSS normalized and log ratio transformations are applied in PCA and PLS-DA.
Load the latest version of mixOmics.
Data: Koren Data Set
Using the Koren data in this example we assume the data are prefiltered and normalised with TSS, as described in mixMC normalisation.
# normalised data data("Koren.16S") #ls(Koren.16S) koren.TSS= Koren.16S$data.TSS
We set up the outcome “Y” as a factor indicating the body sites of each sample. A repeated measures design is not nessessary as each sample is not repeated more then once.
Y = Koren.16S$indiv$body_site summary(Y)
Unsupervised Analsysis with Principal Component Analysis
PCA enables visualisation of diversity patterns in an unsupervised analysis. We need to specify the log ratio transformation (choose between ‘CLR’ or ‘ILR’) and specify a sufficient number of components to understand how the variation in the data is explained per component.
pca.res = pca(koren.TSS, ncomp = 10, logratio = 'CLR') plot(pca.res)
The barplot above depicts the percentage of explained variance per component for a PCA with 10 components. It is a good indicator of how many components to retain in the final PCA model (here two components will be sufficient).
The sample plot below displays the samples colored according to Y, but note that PCA does not take Y into account in the analysis!
#plot first component plotIndiv(pca.res, comp = c(1,2), pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:3), legend = TRUE)
Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA
sPLS-DA performs variable (OTU) selection on each component, see sPLS-DA for details.
Firstly, to choose the number of components for sPLS_DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data). We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).
koren.plsda <- plsda(X = koren.TSS, Y, ncomp = nlevels(Y) - 1) koren.perf.plsda <- perf(koren.plsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) plot(koren.perf.plsda, overlay = 'measure', sd=TRUE)
Above is the plot of the classification error rate averaged across 5 folds and the 10 repeated CV for all prediction distances. BER stands for balanced error rate, which accounts for unbalanced number of samples per class. This step allows to choose the best prediction distance that will be input in the tune sPLS-DA step below.
#First two components plotIndiv(koren.plsda , comp = c(1,2), group = Y, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'KOREN-16S, PLSDA comp 1 - 2')
Secondly, we tune keepX in sPLSDA with the chosen number of components + 1 (to double check our choice from the step above) using tune.splsda. We choose 5-fold cross-validation repeated 10 times in this example. To ensure a stable result we advise to set nrepeat = 50-100)
# this chunk takes ~ 12 min to run splsda.tune = tune.splsda(koren.TSS, Y = Y, ncomp = 3, multilevel = NULL, logratio = 'CLR', test.keepX = c(seq(5,150, 5)), validation = c('Mfold'), folds = 5, dist = 'max.dist', # prediction distance can be chosen according to tune.plsda results nrepeat = 10) # may show some convergence issues for some of the cases, it is ok for tuning
Our tuning indicate the following optimal keepX to select on each component:
## comp1 comp2 comp3 ## 150 15 145
The following command line will output the error rate according to the number of variable selected:
# The plot will show the average error rate with respect to the keepX values tested: plot(splsda.tune)
We now run a sPLS-DA non-multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation (see details in our manuscript here)).
# input parameters for sPLS-DA # if you have not tuned the model, you can set the optimal keepX to our results: choice.keepX = c(150, 15, 145) # optimal keepX values according to the tuning criterion above. choice.ncomp = length(choice.keepX) # the number of components # the sPLS-DA res.splsda = splsda(X = koren.TSS, Y = Y, ncomp = choice.ncomp, keepX = choice.keepX, logratio= "CLR")
The sample plot below shows the sPLS-DA first two components. The ellipse are 0.95 confidence interval ellipse for each body site.
#for component 1 & 2 plotIndiv(res.splsda, ind.names = F, col.per.group = color.mixo(1:3), comp = c(1,2), pch = 16, ellipse = TRUE, legend = TRUE)
The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component:
head(selectVar(res.splsda, comp = 1)$value) # just a head
## value.var ## 1855954 0.2316218 ## 4325533 0.2079620 ## 4403632 0.1831214 ## 752354 0.1802133 ## 363064 0.1735628 ## 323473 0.1733461
The classification performance of the final sPLS-DA model can be assessed using the function perf(). The mean error rates per component and per prediction distance are output.The prediction distance can also be specified, see ?perf.
splsda.perf = perf(res.splsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10) splsda.perf$error.rate
## $overall ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.306976744 0.211627907 0.21162791 ## comp 2 0.006976744 0.006976744 0.01395349 ## comp 3 0.065116279 0.065116279 0.06279070 ## ## $BER ## max.dist centroids.dist mahalanobis.dist ## comp 1 0.325470085 0.222735043 0.22273504 ## comp 2 0.006666667 0.006666667 0.01333333 ## comp 3 0.062222222 0.062222222 0.06000000
## $max.dist ## comp 1 comp 2 comp 3 ## UBERON:artery wall 0.7692308 0 0.0 ## UBERON:feces 0.0000000 0 0.2 ## UBERON:mouth 0.2000000 0 0.0 ## ## $centroids.dist ## comp 1 comp 2 comp 3 ## UBERON:artery wall 0.3846154 0.00000000 0.0 ## UBERON:feces 0.0000000 0.06666667 0.2 ## UBERON:mouth 0.2666667 0.00000000 0.0 ## ## $mahalanobis.dist ## comp 1 comp 2 comp 3 ## UBERON:artery wall 0.3846154 0 0.0000000 ## UBERON:feces 0.0000000 0 0.2666667 ## UBERON:mouth 0.2666667 0 0.0000000
Clustered Image maps
We represent clustered image maps (with Euclidian distance, Ward linkage set by default) for the OTU selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, see ?cim
cim(res.splsda, comp = 1, row.sideColors = color.mixo(Y))
The sPLS-DA selects the most discriminative OTUs that best characterize each body site indicated in Y. The contribution plots below displays the importance of each OTU in the sPLS-DA model and in which body site they are the most abundant (contrib = 'max'), according to the median (method = 'median'). Other options are available, see ?plotLoadings.
#for component 1 plotLoadings(res.splsda, comp = 1, method = 'mean', contrib = 'max')
Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C. and Ley, R.E., 2013. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol, 9(1), p.e1002863
Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169