HMP Oral Data 16S

Here we apply mixMC to the Human Microbiome Oral 16S data set. In this data set we considered samples from oral cavity, which has been found to be as diverse as the stool microbiome. The nine oral sites were Attached Keratinising Gingiva, Buccal Mucosa, Hard Palate, Palatine Tonsils, Saliva, Subgingival Plaque, Supragingival Plaque, Throat and Tongue Dorsum. After prefiltering, the data include 1,562 OTU for 73 unique patients and a total of 657 samples.

Dont forget to update to the latest mixOmics version

library(mixOmics)


Data: Oral Data Set

The Oral data set used in this example can be dowloaded from the mixMC page.

load('Data/HMP-Oral.RData')


Filtering and Normalisation

Using the Oral 16S data in this example we assume the data are prefiltered and normalised with TSS, see Filtering and Normalisation for more details.

Set up the outcome “Y” as a factor indicating the body sites of each sample and “sample” which indicates the unique ID of each repeated individual (repeated measures design).

# input the Y outcome, which is a factor indicating the body sites of each sample while sample a vector indicating the unique ID of each repeated individual
Y  =  as.factor(indiv$HMPbodysubsite) summary(Y)  ## Attached_Keratinized_gingiva Buccal_mucosa ## 73 73 ## Hard_palate Palatine_Tonsils ## 73 73 ## Saliva Subgingival_plaque ## 73 73 ## Supragingival_plaque Throat ## 73 73 ## Tongue_dorsum ## 73  sample = as.factor(indiv$RSID)


Unsupervised Analsysis with PCA

PCA enables visualisation of diversity patterns in an unsupervised analysis. We need to specify the log ratio transformation (choose between ‘CLR’ or ‘ILR’) and specify a sufficient number of components to understand how the variation in the data is explained per component.

pca.res = pca(X = data.TSS, ncomp = 10, logratio = 'ILR', multilevel = sample)
plot(pca.res)


The barplot above depicts the percentage of explained variance per component for a PCA. It is a good indicator of how many components to retain in the final PCA model. The sample plot below displays the samples colored according to Y, but note that PCA does not take Y into account in the analysis!

# plotting the first components
plotIndiv(pca.res,  comp = c(1,2),
pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:9),
legend = TRUE
)


Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA

sPLS-DA performs variable (OTU) on each component, see sPLS-DA for details.

Tuning sPLS-DA

To choose the number of components for sPLS_DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data).

#plsda
data.plsda <- plsda(X = data.TSS, Y, ncomp = nlevels(Y) - 1)

# check performance
perf.plsda <- perf(data.plsda, validation = 'Mfold', folds = 5,
progressBar = FALSE, nrepeat = 10)

plot(perf.plsda, overlay = 'measure', sd=TRUE)


Above is the plot of the classification error rate averaged across 5 folds and the 10 repeated CV for all prediction distances. BER stands for balanced error rate, which accounts for unbalanced number of samples per class. This step allows to choose the best prediction distance that will be input in the tune sPLS-DA step below.

plotIndiv(data.plsda , comp = c(1,2),
group = Y, ind.names = FALSE,
ellipse = TRUE, legend = TRUE, title = 'Oral-16S, PLSDA comp 1 - 2')


We can observe better clusters than previously with the PCA analysis. This is to be expected since the PLS-DA model includes the class information of each individual. The sPLS-DA analysis will help refine the sample clusters and select a small subset of variables relevant to discriminate each class.

Tuning sPLS-DA

The tuning of sPLS-DA is being performed one component at a time inside the function and the optimal number of variables to select is automatically retrieved for each component. We set ncomp = 5 and we used 10-fold cross validation (folds = 5 repeated 10 times). To ensure a stable result we advise to set nrepeat = 50-100.

# set the tuning grid for some possible values of keepX
list.keepX<- c(5:10, seq(15, 50, 5), seq(60, 100, 10))
ncomp = 5
splsda.tune = tune.splsda(data.TSS, Y,
ncomp = 5, # (ncomp + 1)
test.keepX = list.keepX,
multilevel = sample,
logratio = 'CLR',
validation = 'Mfold',
folds = 5, dist = "centroids.dist", nrepeat = 10) #recomended between 50-100
# may show some convergence issues for some of the cases, it is ok for tuning

# mean error rate across all CV folds and nrepeats
head(splsda.tune$error.rate)  ## comp1 comp2 comp3 comp4 comp5 ## 1 0.7745814 0.5589041 0.5185693 0.5030441 0.4841705 ## 2 0.7646880 0.5414003 0.5210046 0.5025875 0.4799087 ## 3 0.7579909 0.5418569 0.5234399 0.5044140 0.4768645 ## 4 0.7534247 0.5423135 0.5249619 0.5009132 0.4742770 ## 5 0.7462709 0.5414003 0.5267884 0.5007610 0.4722983 ## 6 0.7403349 0.5395738 0.5280061 0.5013699 0.4686454  # optimal keepX achieving lowest error rate head(splsda.tune$choice.keepX)

## comp1 comp2 comp3 comp4 comp5
##   100     8     1    50    40

plot(splsda.tune, optimal = TRUE, sd = TRUE)


The graphic above shows that the error rate decreases when more components are included in sPLS-DA.

sPLS-DA

We now run a sPLS-DA multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation.

select.keepX = splsda.tune$choice.keepX #from tuning set above select.ncomp = length(select.keepX) res.splsda = splsda(X = data.TSS, Y = Y, multilevel = sample, ncomp = 5, keepX = select.keepX, logratio= "CLR")  #on the first two components plotIndiv(res.splsda, comp =c(1,2), ind.names = FALSE, col.per.group = color.mixo(1:9), # the number of groups pch = 16, ellipse = TRUE, legend = TRUE, title = 'ORAL, sPLSDA comp 1 - 2')  Evaluating sPLS-DA The classification performance of the sPLS-DA multilevel model can be assessed using the function perf(). The mean error rates per component are output and type of distance are output. The prediction distance can also be specified, see ?perf. #set seed set.seed(34) splsda.perf = perf(res.splsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10)  load('RData/ORAL-perf-sPLSDA.RData')  kable(head(splsda.perf$error.rate.class))

comp 1 comp 2 comp 3 comp 4 comp 5
Attached_Keratinized_gingiva 0.6438356 0.4657534 0.3698630 0.3561644 0.3424658
Buccal_mucosa 0.5205479 0.4246575 0.2602740 0.2602740 0.2328767
Hard_palate 0.7260274 0.3013699 0.2739726 0.2602740 0.2465753
Palatine_Tonsils 0.9041096 0.8767123 0.8630137 0.7123288 0.6712329
Saliva 0.5342466 0.2328767 0.3424658 0.3561644 0.3287671
Subgingival_plaque 0.4520548 0.5068493 0.5205479 0.5068493 0.4520548
Supragingival_plaque 0.3972603 0.4109589 0.4520548 0.4520548 0.2602740
Throat 0.8630137 0.8082192 0.7808219 0.6301370 0.5753425
Tongue_dorsum 0.3698630 0.2739726 0.3561644 0.2876712 0.2054795
plot(splsda.perf)


head(selectVar(res.splsda, comp = 1)\$value)

##              value.var
## OTU_97.33396 0.2275421
## OTU_97.1893  0.2076086
## OTU_97.108   0.2050321
## OTU_97.39    0.2005539
## OTU_97.46    0.1994383
## OTU_97.23410 0.1963586


Contribution plots

The sPLS-DA selects the most discriminative OTUs that best characterize each body site indicated in Y. The contribution plots below displays the importance of each OTU in the sPLS-DA model and in which body site they are the most abundant (contrib = 'max'), according to the median (method = 'median'). Other options are available, see ?plotLoadings

#for component 1

#for component 1