Non-Repeated Measures (Koren)

Koren Data 16s

Here we apply mixMC to the 16S study of Koren et al. (2011) who examined the link between oral, gut and plaque microbial communities in patients with atherosclerosis and controls. Only healthy individuals were retained in the analysis. This study contained partially repeated measures from multiple sites including 15 unique patients samples from saliva and stool, and 13 unique patients only sampled from arterial plaque samples and we therefore considered a non multilevel analysis for that experimental design (i.e. no repeated measurements). The data were downloaded from the Qiita database. After pre-filtering, the data included 973 OTU for 43 samples. Data were TSS normalized and log ratio transformations are applied in PCA and PLS-DA.

Load the latest version of mixOmics.

library(mixOmics)

Data: Koren Data Set

Using the Koren data in this example we assume the data are prefiltered and normalised with TSS, as described in mixMC normalisation.

# normalised data
data("Koren.16S")
#ls(Koren.16S)
koren.TSS= Koren.16S$data.TSS

We set up the outcome “Y” as a factor indicating the body sites of each sample. A repeated measures design is not nessessary as each sample is not repeated more then once.

Y  =  Koren.16S$indiv$body_site
summary(Y)

Unsupervised Analsysis with Principal Component Analysis

PCA enables visualisation of diversity patterns in an unsupervised analysis. We need to specify the log ratio transformation (choose between ‘CLR’ or ‘ILR’) and specify a sufficient number of components to understand how the variation in the data is explained per component.

pca.res = pca(koren.TSS, ncomp = 10, logratio = 'CLR')
plot(pca.res)

plot of chunk pca

The barplot above depicts the percentage of explained variance per component for a PCA with 10 components. It is a good indicator of how many components to retain in the final PCA model (here two components will be sufficient).

The sample plot below displays the samples colored according to Y, but note that PCA does not take Y into account in the analysis!

#plot first component
plotIndiv(pca.res,  comp = c(1,2),
          pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:3),
          legend = TRUE)

plot of chunk plotIndiv PCA

Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA

sPLS-DA performs variable (OTU) selection on each component, see sPLS-DA for details.

Firstly, to choose the number of components for sPLS_DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data). We choose 5-fold cross-validation repeated 10 times. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100).

koren.plsda <- plsda(X = koren.TSS, Y, ncomp = nlevels(Y) - 1)

koren.perf.plsda <- perf(koren.plsda, validation = 'Mfold', folds = 5,
                    progressBar = FALSE, nrepeat = 10)

plot(koren.perf.plsda, overlay = 'measure', sd=TRUE)

plot of chunk plsda

Above is the plot of the classification error rate averaged across 5 folds and the 10 repeated CV for all prediction distances. BER stands for balanced error rate, which accounts for unbalanced number of samples per class. This step allows to choose the best prediction distance that will be input in the tune sPLS-DA step below.

#First two components
plotIndiv(koren.plsda , comp = c(1,2),
          group = Y, ind.names = FALSE, 
          ellipse = TRUE, legend = TRUE, title = 'KOREN-16S, PLSDA comp 1 - 2')

plot of chunk plot plsda

Tuning sPLS-DA

Secondly, we tune keepX in sPLSDA with the chosen number of components + 1 (to double check our choice from the step above) using tune.splsda. We choose 5-fold cross-validation repeated 10 times in this example. To ensure a stable result we advise to set nrepeat = 50-100)

# this chunk takes ~ 12 min to run
splsda.tune = tune.splsda(koren.TSS, 
                          Y = Y, 
                          ncomp = 3, 
                          multilevel = NULL, 
                          logratio = 'CLR',
                          test.keepX = c(seq(5,150, 5)), 
                          validation = c('Mfold'), 
                          folds = 5, 
                          dist = 'max.dist', # prediction distance can be chosen according to tune.plsda results
                          nrepeat = 10)
# may show some convergence issues for some of the cases, it is ok for tuning

Our tuning indicate the following optimal keepX to select on each component:

splsda.tune$choice.keepX
## comp1 comp2 comp3 
##   150    15   145

The following command line will output the error rate according to the number of variable selected:

kable(head(splsda.tune$error.rate))
comp1 comp2 comp3
5 0.3933333 0.0200000 0.0936752
10 0.3688889 0.0088889 0.0914530
15 0.3577778 0.0000000 0.0981197
20 0.3488889 0.0000000 0.0892308
25 0.3377778 0.0000000 0.0870085
30 0.3345299 0.0000000 0.0825641
# The plot will show the average error rate with respect to the keepX values tested:
plot(splsda.tune)

plot of chunk unnamed-chunk-4

sPLS-DA

We now run a sPLS-DA non-multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation (see details in our manuscript here)).

# input parameters for sPLS-DA
# if you have not tuned the model,  you can set the optimal keepX to our results:

choice.keepX = c(150, 15, 145) # optimal keepX values according to the tuning criterion above.

choice.ncomp = length(choice.keepX) # the number of components


# the sPLS-DA
res.splsda = splsda(X = koren.TSS, 
                    Y = Y,
                    ncomp = choice.ncomp,
                    keepX = choice.keepX,
                    logratio= "CLR")

sPLS-DA plots

The sample plot below shows the sPLS-DA first two components. The ellipse are 0.95 confidence interval ellipse for each body site.

#for component 1 & 2
plotIndiv(res.splsda, 
          ind.names = F, 
          col.per.group = color.mixo(1:3), 
          comp = c(1,2),
          pch = 16, 
          ellipse = TRUE,
          legend = TRUE)

plot of chunk splsda-plot

The code below outputs the first selected OTUs and their coefficient (from the loading vector) on the first component:

head(selectVar(res.splsda, comp = 1)$value)  # just a head
##         value.var
## 1855954 0.2316218
## 4325533 0.2079620
## 4403632 0.1831214
## 752354  0.1802133
## 363064  0.1735628
## 323473  0.1733461

Evaluating sPLS-DA

The classification performance of the final sPLS-DA model can be assessed using the function perf(). The mean error rates per component and per prediction distance are output.The prediction distance can also be specified, see ?perf.

splsda.perf = perf(res.splsda, validation = 'Mfold', folds = 5, 
                   progressBar = FALSE, nrepeat = 10)

splsda.perf$error.rate
## $overall
##           max.dist centroids.dist mahalanobis.dist
## comp 1 0.306976744    0.211627907       0.21162791
## comp 2 0.006976744    0.006976744       0.01395349
## comp 3 0.065116279    0.065116279       0.06279070
## 
## $BER
##           max.dist centroids.dist mahalanobis.dist
## comp 1 0.325470085    0.222735043       0.22273504
## comp 2 0.006666667    0.006666667       0.01333333
## comp 3 0.062222222    0.062222222       0.06000000
head(splsda.perf$error.rate.class)
## $max.dist
##                       comp 1 comp 2 comp 3
## UBERON:artery wall 0.7692308      0    0.0
## UBERON:feces       0.0000000      0    0.2
## UBERON:mouth       0.2000000      0    0.0
## 
## $centroids.dist
##                       comp 1     comp 2 comp 3
## UBERON:artery wall 0.3846154 0.00000000    0.0
## UBERON:feces       0.0000000 0.06666667    0.2
## UBERON:mouth       0.2666667 0.00000000    0.0
## 
## $mahalanobis.dist
##                       comp 1 comp 2    comp 3
## UBERON:artery wall 0.3846154      0 0.0000000
## UBERON:feces       0.0000000      0 0.2666667
## UBERON:mouth       0.2666667      0 0.0000000
plot(splsda.perf)

plot of chunk unnamed-chunk-8

Clustered Image maps

We represent clustered image maps (with Euclidian distance, Ward linkage set by default) for the OTU selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, see ?cim

cim(res.splsda, comp = 1, row.sideColors = color.mixo(Y))

plot of chunk unnamed-chunk-10

Contribution plots

The sPLS-DA selects the most discriminative OTUs that best characterize each body site indicated in Y. The contribution plots below displays the importance of each OTU in the sPLS-DA model and in which body site they are the most abundant (contrib = 'max'), according to the median (method = 'median'). Other options are available, see ?plotLoadings.

#for component 1
plotLoadings(res.splsda, comp = 1, method = 'mean', contrib = 'max')

plot of chunk unnamed-chunk-11

References

  1. Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C. and Ley, R.E., 2013. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol, 9(1), p.e1002863

  2. Lê Cao, K.-A., Boitard, S., Besse, P.: Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics 12(1), 253 (2011)

  3. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169. doi: 10.1371/journal.pone.0160169

  4. Asnicar, F., Weingart, G., Tickle, T.L., Huttenhower, C. and Segata, N., 2015. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ, 3, p.e1029.