HMP Oral Data 16S
Here we apply mixMC to the Human Microbiome Oral 16S data set. In this data set we considered samples from oral cavity, which has been found to be as diverse as the stool microbiome. The nine oral sites were Attached Keratinising Gingiva, Buccal Mucosa, Hard Palate, Palatine Tonsils, Saliva, Subgingival Plaque, Supragingival Plaque, Throat and Tongue Dorsum. After prefiltering, the data include 1,562 OTU for 73 unique patients and a total of 657 samples.
Launch mixOmics, which can be downloaded on our website.
Dont forget to update to the latest mixOmics version
Data: Oral Data Set
The Oral data set used in this example can be dowloaded from the mixMC page.
Filtering and Normalisation
Using the Oral 16S data in this example we assume the data are prefiltered and normalised with TSS, see Filtering and Normalisation for more details.
Set up the outcome “Y” as a factor indicating the body sites of each sample and “sample” which indicates the unique ID of each repeated individual (repeated measures design).
# input the Y outcome, which is a factor indicating the body sites of each sample while sample a vector indicating the unique ID of each repeated individual Y = as.factor(indiv$HMPbodysubsite) summary(Y)
## Attached_Keratinized_gingiva Buccal_mucosa ## 73 73 ## Hard_palate Palatine_Tonsils ## 73 73 ## Saliva Subgingival_plaque ## 73 73 ## Supragingival_plaque Throat ## 73 73 ## Tongue_dorsum ## 73
sample = as.factor(indiv$RSID)
Unsupervised Analsysis with PCA
PCA enables visualisation of diversity patterns in an unsupervised analysis. We need to specify the log ratio transformation (choose between ‘CLR’ or ‘ILR’) and specify a sufficient number of components to understand how the variation in the data is explained per component.
pca.res = pca(X = data.TSS, ncomp = 10, logratio = 'ILR', multilevel = sample) plot(pca.res)
The barplot above depicts the percentage of explained variance per component for a PCA. It is a good indicator of how many components to retain in the final PCA model. The sample plot below displays the samples colored according to Y, but note that PCA does not take Y into account in the analysis!
# plotting the first components plotIndiv(pca.res, comp = c(1,2), pch = 16, ind.names = F, group = Y, col.per.group = color.mixo(1:9), legend = TRUE )
Supervised Analysis and Selection of Discriminative OTUs with sPLS-DA
sPLS-DA performs variable (OTU) on each component, see sPLS-DA for details.
To choose the number of components for sPLS_DA we run the function perf() for a PLS-DA model with no variable selection (often, ncomp = K-1 where K is the number of categories in the outcome Y is sufficient, but depends on the data).
#plsda data.plsda <- plsda(X = data.TSS, Y, ncomp = nlevels(Y) - 1)
# check performance perf.plsda <- perf(data.plsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10)
plot(perf.plsda, overlay = 'measure', sd=TRUE)
Above is the plot of the classification error rate averaged across 5 folds and the 10 repeated CV for all prediction distances. BER stands for balanced error rate, which accounts for unbalanced number of samples per class. This step allows to choose the best prediction distance that will be input in the tune sPLS-DA step below.
plotIndiv(data.plsda , comp = c(1,2), group = Y, ind.names = FALSE, ellipse = TRUE, legend = TRUE, title = 'Oral-16S, PLSDA comp 1 - 2')
We can observe better clusters than previously with the PCA analysis. This is to be expected since the PLS-DA model includes the class information of each individual. The sPLS-DA analysis will help refine the sample clusters and select a small subset of variables relevant to discriminate each class.
The tuning of sPLS-DA is being performed one component at a time inside the function and the optimal number of variables to select is automatically retrieved for each component. We set ncomp = 5 and we used 10-fold cross validation (folds = 5 repeated 10 times). To ensure a stable result we advise to set nrepeat = 50-100.
# set the tuning grid for some possible values of keepX list.keepX<- c(5:10, seq(15, 50, 5), seq(60, 100, 10)) ncomp = 5 splsda.tune = tune.splsda(data.TSS, Y, ncomp = 5, # (ncomp + 1) test.keepX = list.keepX, multilevel = sample, logratio = 'CLR', validation = 'Mfold', folds = 5, dist = "centroids.dist", nrepeat = 10) #recomended between 50-100 # may show some convergence issues for some of the cases, it is ok for tuning
# mean error rate across all CV folds and nrepeats head(splsda.tune$error.rate)
## comp1 comp2 comp3 comp4 comp5 ## 1 0.7745814 0.5589041 0.5185693 0.5030441 0.4841705 ## 2 0.7646880 0.5414003 0.5210046 0.5025875 0.4799087 ## 3 0.7579909 0.5418569 0.5234399 0.5044140 0.4768645 ## 4 0.7534247 0.5423135 0.5249619 0.5009132 0.4742770 ## 5 0.7462709 0.5414003 0.5267884 0.5007610 0.4722983 ## 6 0.7403349 0.5395738 0.5280061 0.5013699 0.4686454
# optimal keepX achieving lowest error rate head(splsda.tune$choice.keepX)
## comp1 comp2 comp3 comp4 comp5 ## 100 8 1 50 40
plot(splsda.tune, optimal = TRUE, sd = TRUE)
The graphic above shows that the error rate decreases when more components are included in sPLS-DA.
We now run a sPLS-DA multilevel analysis. Note: with sPLS-DA we can only choose a CLR transformation.
select.keepX = splsda.tune$choice.keepX #from tuning set above select.ncomp = length(select.keepX) res.splsda = splsda(X = data.TSS, Y = Y, multilevel = sample, ncomp = 5, keepX = select.keepX, logratio= "CLR")
#on the first two components plotIndiv(res.splsda, comp =c(1,2), ind.names = FALSE, col.per.group = color.mixo(1:9), # the number of groups pch = 16, ellipse = TRUE, legend = TRUE, title = 'ORAL, sPLSDA comp 1 - 2')
The classification performance of the sPLS-DA multilevel model can be assessed using the function perf(). The mean error rates per component are output and type of distance are output. The prediction distance can also be specified, see ?perf.
#set seed set.seed(34) splsda.perf = perf(res.splsda, validation = 'Mfold', folds = 5, progressBar = FALSE, nrepeat = 10)
|comp 1||comp 2||comp 3||comp 4||comp 5|
head(selectVar(res.splsda, comp = 1)$value)
## value.var ## OTU_97.33396 0.2275421 ## OTU_97.1893 0.2076086 ## OTU_97.108 0.2050321 ## OTU_97.39 0.2005539 ## OTU_97.46 0.1994383 ## OTU_97.23410 0.1963586
The sPLS-DA selects the most discriminative OTUs that best characterize each body site indicated in Y. The contribution plots below displays the importance of each OTU in the sPLS-DA model and in which body site they are the most abundant (contrib = 'max'), according to the median (method = 'median'). Other options are available, see ?plotLoadings
#for component 1 plotLoadings(res.splsda, comp = 1, method = 'mean', contrib = 'max')
Clustered Image Map
We represent clustered image maps (with Euclidian distance, Ward linkage set by default) for the OTU selected on each sPLS-DA component. The abundance values that are displayed are the normalised, log ratio transformed values. All OTUs selected by the sPLS-DA model are displayed, other options can include a specific component, see ?cim.
#for component 1 cim(res.splsda, comp = 1, row.sideColors = color.mixo(Y))